What with Current NOLA 2025 happening this week, and some very last minute preparations for the demo at the keynote on day 2, this month’s links roundup is pushing it right up to the wire :) The demo was pretty cool, and finally I have a good example of how this AI stuff actually fits into a workflow ;) I’ll write it up as a blog post (or two, probably)—stay tuned!
Some self-promotion to begin with:
- 
This month a couple of colleagues and I launched Flink Watermarks…WTF. It’s an interactive explainer about watermarks in Apache Flink. Try it out and let me know what you think. - 
Oh, and I even designed some stickers for it! 
 
- 
- 
I gave a talk about Blog Writing for Developers - check out the link for slides and audio recording 
- 
I was a guest on the Confluent Developer podcast - 🎥 video here, 🎧 audio here 
With that, on with the interesting links!
Not got time for all this? I’ve marked 🔥 for my top reads of the month :)
Kafka and Event Streaming 🔗
- 
Probably the biggest discussion in the Apache Kafka community at the moment is the direction of the project with regards to "Diskless" (or "Direct-to-S3"). Here’s a round-up of some of the key reading: - 
🔥 Summary from Luke Chen of the different proposals, and more recently analysis and commentary from Jack Vanlightly. 
- 
Discussion of the KIP-1150 proposal on the Apache Kafka mailing list 
- 
More analysis and commentary from Fresha’s Anton Borisov. 
 
- 
- 
Updated recently, Hans-Peter Grahsl and Gunnar Morling’s A Great Day Out With… Apache Kafka is a useful map of the tools and ecosystem. 
- 
Vu Trinh does a deep-dive on how the Kafka-compatible Bufstream handles data validation, comparing it to the Kafka + Schema Registry approach 
- 
Interesting analysis from Ivan Juren on the linger.mssetting in Kafka and the throughput/latency/CPU trade-off.
- 
🔥 Federico Valeri has published a very well-written deep dive into Kafka’s KRaft protocol. 
- 
A nice TUI for Kafka from Dustin Dobervich. 
- 
The Queues feature for Kafka was added recently - this demo from Italo Nesi is a neat way to explore it. 
- 
Klaviyo’s Chinmay Sawaji has written a good post explaining how they build their Kafka producers to be resilient to failures. 
- 
In a fantastic example of both "just because I can" and "I’m going to explain this thing using a cool example", Leandro Proença shows how to rebuild Kafka using UNIX signals. 
- 
In a somewhat more serious approach (I think?) Stanislav Kozlovski makes the case for using Postgres instead of Kafka in many situations. Oliver Russell wrote last year about how his team actually do use Postgres as a queue. 
- 
Backfilling data in Kafka is definitely a "day 2" type problem, but definitely a real one—and Nejc Korasa has a nice write-up of some of the patterns to consider. 
- 
A tool from Jeffrey Jonathan Jennings to analyse key distribution and help avoid hot partitions. 
Stream Processing 🔗
- 
Very cool blog post from the team at Grab on using machine learning to predict workloads and scale Flink automagically. 
- 
Milind Srivastava and colleagues at CMU have published a library of sketching algorithms for Flink’s DataStream API. 
- 
🔥 Yennick Trevels has published both a Kafka Streams monitoring guide as well as an excellent Grafana dashboard for Kafka Streams. 
- 
Flink’s Hadoop-rooted support for S3 has caused plenty of travails for lots of people, including me—and the community has recognised this with a discussion beginning about creating native support for S3 within Flink. 
- 
A report from Flink Forward 2025 by Yaroslav Tkachenko. 
- 
Hands-on example from Gal Krispel at Riskified on how they use Flink’s DataStream API to validate and pre-process data to make their Flink SQL pipelines more resilient. 
- 
Netflix’s Adrian Taruc and James Dalton describe how they’ve used Kafka, Flink, and Iceberg to build a real-time distributed graph. There’s some good detail in there about the processing that Flink does, and their experiences in scaling it. 
- 
Reddit’s Vignesh Raja and Jerry Chu write about their experience with Flink’s tumbling window joins and their own custom join implementation. 
Streaming Analytics 🔗
- 
🔥 Excellent article (and accompanying code repo) from Guillermo Sanchez showing how low-latency analytics on data from Kafka can be done in DuckDB. Definitely adding this to my list to try out and write about myself :) 
- 
In a similar vein, Yuxia Luo has published a DuckDB extension to directly query Apache Fluss. 
Analytics 🔗
- 
Aakash Pradeep and his colleagues at Twilio built Odin, which is a multi-engine query platform enabling them to offer Amazon Athena alongside the existing Presto. 
- 
Details of how Chinese ride-sharing company DiDi’s evaluation of StarRocks against ClickHouse. Also from StarRocks is a look at VBill’s migration of a real-time data pipeline from a Kudu/HBase/Hive architecture to StarRocks and some of the optimisations implemented. 
- 
Ankit Sultana and his colleagues at Uber write about their migration from a Presto-based proxy in front of Pinot toward a Pinot-native architecture including Pinot’s Multi-Stage Engine Lite Mode to serve real-time analytics workloads. 
Data Platforms, Architectures, and Modelling 🔗
- 
🔥 Practical advice from Joe Reis on data modeling—specifically, how to get buy-in from your company to actually do it properly. 
- 
An updated version of a16z’s 2020 post looking at Emerging Architectures for Modern Data Infrastructure. 
- 
Getting everyone (in the small world that is data engineering) all excited, Fivetran and dbt merged recently. Michael Driscoll has a measured analysis of it over on LinkedIn. 
- 
Taking a broader look at what’s become of the Modern Data Stack is this excellent article from Travis Thompson and Animesh Kumar. Insightful and detailed analysis with plenty of evidence to back up their hypotheses. 
- 
My Confluent colleague Alex Stuart wrote a good post about Building a Better-Governed Data Lake Architecture. 
- 
An interesting architecture idea from Ananth Packkildurai: Data Vault in Silver, Dimensional Modeling in Gold. 
- 
Where should data contracts go? Mark Freeman and Chad Sanderson tell us. 
Data Engineering, Pipelines, and CDC 🔗
- 
Debezium 3.4.0.Alpha1 has been released, which includes support for Postgres 18, OpenLineage output from Debezium Server, improvements to the Oracle LogMiner support, and more. 
- 
What’s the best way to add a new table in Debezium? Fiore Mario Vitale explains it here, including things to watch out for. 
- 
I enjoyed reading this one, as my assumption about partitioning is exactly what Kirill Bobrov says here is not the way to do it (and explains an alternative approach instead). 
- 
🔥 It can’t really be a month of interesting links without at least one from Jack Vanlightly, and this month we have three :) This post is this well-reasoned argument as to why he is not a fan of zero-copy for getting data from Kafka to Iceberg. 
- 
A two-part series from Kakao describing their implementation and troubleshooting of a CDC pipeline with Kafka Connect from Postgres to Elasticsearch. It’s in Korean but if you open it in Chrome etc the in-browser translation tool will work wonders :) 
- 
A decent comparison of the open-source data ingestion frameworks (Flink/Kafka Connect/Spark) from Shiyan Xu at Onehouse. If you notice a recurring theme of Spark cost and performance optimisation then I’m sure it’s not because Onehouse have their own tool to fix that ;) 
- 
A summary from ByteByteGo on how Pinterest use CDC. 
- 
Fresha have burst onto the data engineering blogging scene in recent months, sharing all sorts of excellent details about their platforms. This post from Emiliano Mancuso explains why they moved from JSON to Avro in their CDC pipelines to Snowflake. 
Open Table Formats (OTF), Catalogs, Lakehouses etc. 🔗
- 
Jack’s back! With a hat-trick of entries in this month’s post, here he’s looking at How Open Table Formats Optimize Query Performance. 
- 
Anton Borisov takes a look at the proposal for the next version of the Iceberg spec and how it could improve things when working with CDC data. 
- 
Vincent Daniel at Expedia writes about Why You Should Prefer MERGE INTOOverINSERT OVERWRITEin Iceberg.
- 
Iceberg catalog Apache Polaris has released v1.2, and Alex Merced has written an article about what’s new. Meanwhile, Apache Gravitino (with bigger ambitions beyond just an Iceberg catalog) has released v1.0. 
- 
🔥 Dipankar Mazumdar has a good article comparing Apache Parquet with newer file formats such as Lance and Vortex. If new formats are your thing, a recent SIGMOD paper announced the open-source F3 (Future-proof File Format). Also doing the rounds this month was news of IndexTables describes itself as "an experimental open-table format for Apache Spark that enables fast retrieval and full-text search across large-scale data", whilst Project Amudai is an "advanced columnar storage format […designed to] address the limitations of existing data lake formats, such as Apache Parquet". 
- 
Petrica Leuca has an interesting post about time travel and versioning in DuckLake. I’m even more of a fan because it starts from the point of investigating SCD type 2—what’s not to like! 
- 
As well as writing from Kafka to Iceberg, Confluent’s TableFlow now supports writing to Delta Lake, upserts, and dead-letter queues. 
- 
Kinda like benchmarks, feature comparisons published by vendors are inherently biased—whether consciously or not. Kyle Weller at Onehouse—who contribute to the Apache Hudi format—has published an updated feature comparison of Iceberg, Hudi, and Delta Lake. You can guess which one comes out on top ;) Snark aside, it’s still a useful article if only to look at the positioning and strengths of Hudi. 
- 
Videos from the recent Greater Seattle and San Francisco Iceberg meetups have been added to their respective playlists. 
- 
Shuiqiang Chen describes how TikTok uses Apache Paimon in their recommendation systems. 
RDBMS 🔗
- 
A nice concise list from Jordan Goodman of SQL Anti-Patterns You Should Avoid. 
- 
What happens when you run DuckDB with a 10TB dataset on a 64 core/512GB machine? Mimoune Djouallah found out. 
- 
Alexey Makhotkin has some excellent content on his blog, including this one looking at the systematic design of multi-join GROUP BYqueries.
- 
🔥 Having recently helped build flink-watermarks.wtfI now pay much more attention to examples of scrollytelling—and this one from Nanda Syahrasyad showing how to Build Your Own Database is really good!
- 
Postgres 18 was released recently, and Ben Dicken did some benchmarking comparing it to Postgres 17 
General Data Stuff 🔗
- 
🔥 Datadog process over 100 trillion events per day, and wrote their own event store called Husky to handle it. They’ve written previously in depth about how it handles exactly-once ingestion and compaction, and in their most recent post Sami Tabet explains how they built its interactive querying capabilities. 
- 
Otter/CloudKitchens found both Stackdriver and OpenSearch too expensive for their logging needs—so they wrote their own (in Rust, of course). They claim some impressive numbers—"750+ TiB of logs at 4.4x lower cost than self-hosted OpenSearch[…]50x cheaper than managed alternatives". 
- 
I do like a property graph, and am interested to look more into Apache GraphAr (incubating) which Sem Sinchenko describes in this article as a standard for Property Graph storage. In other graph news, DuckDB has a graph community extension that Daniël ten Wolde shows in action here. 
- 
Arc [not the web-browser] is a time-series database built on DuckDB, Parquet, and Arrow, and claims ingestion rates of 2.4M records/sec. 
- 
Described as an "open-source immutable SQL database with comprehensive time-travel", XTDB released v2 earlier this year. 
- 
Robert Yokota writes about the Robustness Principle (a.k.a. Postel’s Law) in the context of JSON Schema compatibility. 
- 
OpenAI’s Bohan Zhang spoke at PGConf this year about their use of Postgres and experience scaling it. For more details of OpenAI’s data platforms check out this blog post summarising how they deploy Kafka and Flink on Kubernetes. 
- 
It’s more about video streams than event streams, but this three part series from Netflix is a fascinating behind-the-scenes explainer of how things work. 
AI 🔗
I warned you last month…this AI stuff is here to stay, and it’d be short-sighted to think otherwise. As I read and learn more about it, I’m going to share interesting links (the clue is in the blog post title) that I find—whilst trying to avoid the breathless hype and slop.
- 
I wrote a post trying to get my head around what we mean by Agents. 
- 
Basic Memory is a very cool MCP server that integrates with your AI tool and acts as a memory of your conversations, storing the information locally in Markdown. It integrates very neatly with Obsidian. I’m a big fan. 
- 
Confluent announced a bunch of neat stuff at Current this week including a real time context engine and streaming agents. Product blog posts are m’kay I guess but I always like to see the hands-on detail, and so I enjoyed reading my colleague Yash Anand’s example of building with streaming agents. 
- 
🔥 Very cool talk (video / slides) from Ty Smith and Adam Huda with real-world examples of how Uber’s developers are using AI and what benefits they’re seeing. 
- 
Apache Flink Agents is a sub-project of Apache Flink, and they just had their first release. 
- 
Claude Skills are the latest hawtness (at least until the next thing comes along tomorrow), and Gordon Murray has published a set of them with support for technologies including Flink, Fluss, and Iceberg. 
- 
As well as changing how we get things done, AI is probably going to change how we build platforms too. Ananth Packkildurai has a good analysis of two papers looking at how Agents use data and how systems might be better designed for that, and Ciro Greco looks at how Agents involved in carrying out data engineering tasks might drive platform requirements. 
And finally… 🔗
Nothing to do with data, but stuff that I’ve found interesting or has made me smile.
Think 🔗
- 
A Simple Formula for Responding not Reacting - Brad Stulberg 
- 
🔥 A cartoonist’s review of AI art - The Oatmeal 
- 
Michael Lopp (a.k.a Rands) has an excellent two part series: So You Want to Be Promoted. 
- 
🔥 Stop Avoiding Politics is a great blog post by Matheus Lima. I wish I could go back several years and show it to younger-me ;) 
Tool 🔗
- 
I used freedium.cfdin previous editions of this series, and unfortunately it’s gone offline.scribe.ripis similar in concept—read Medium articles, without having to go to Medium.com (because, paywall, etc). I’m not going to use it on the links in this blog post (like I did withfreedium.cfd) because everything breaks if/when it goes offline.
- 
time.isis a very useful site that displays the current time for any timezone. It’s got a lovely clean interface, and a neat UX where you can just append the timezone to the URL:https://time.is/gmt,https://time.is/pt, etc.
Watch 🔗
Nerd 🔗
- 
An interactive simulation of a Citizen Quartz Multi Alarm III watch, by Andy Jakubowski 
- 
Nothing motivates a nerd more than a perceived wrong, and this is a fantastic example of the lengths folk will go to :) How I Reversed Amazon’s Kindle Web Obfuscation Because Their App Sucked. 
- 
🔥 Don’t stop to ask WHY, just click on the link and admire the goodness that is a Shader…written in SQL 
| Note | 
 | 
