This is the twelfth edition of this newsletter in its current form. It’s great to see the audience for it growing, and consistently positive reception when I share it. Nice words always inspire me to carry on with it :D The substack edition (which is exactly the same content but sent out by email), is also picking up views and subscribers.
A couple of blog posts from me since the last edition of Interesting Links—both outside the usual Kafka/Flink scope:
I’ve also been firmly on-board the Claude Code/Opus 4.5 bandwagon, giggling like a child at the sheer magnitude of what it can now do. I’m going to write a blog post in its own right shortly in more detail, but if you want to marvel at what AI can do: I migrated my previous talks site (and the one before that) to this brand new one. Without writing a single line of code. Not one. Not a single byte.
So anyway. More AI blog posts to come, but for now—on to the Interesting Links!
|
Tip
|
|
Looking back (Reviews of 2025)… 🔗
See below for AI-specific links about 2025.
-
Data
-
RisingWave and Aiven both did round-table 2025 roundups with panels of people from a subset of the Kafka community.
-
General tech & industry
…and looking forward (predictions for 2026) 🔗
See below for AI-specific links about 2026.
-
Oxide & Friends (Bryan Cantrill & Adam Leventhal, plus Simon Willison, Steve Klabnik, and Ian Grunert).
-
Ben Lorica - Data Engineering in 2026: What Changes?
-
Paul Dix - 2026: The Great Engineering Divergence.
-
🔥 Joe Reis - 2026 - General Thoughts on What’s Ahead.
-
Ian Cook (Columnar) - 10 Predictions for Data Infrastructure in 2026.
-
Simon Späti - Data Engineering: Trends and Predictions.
-
Darren Wood - Data predictions for 2026.
Kafka and Event Streaming 🔗
-
🔥 Another good blog post from Stefan Kecskes, looking at Dead Letter Queue (DLQ) handling in Kafka. For a hit of nostalgia, here’s a blog post that I wrote in 2019 also looking at DLQs in Kafka Connect.
-
How many TUIs for Kafka is too many? Well, we’re not there yet, and Hoa Nguyen brings us LazyKafka. Out of curiosity I did a quick Google and found seven TUIs in total, including this one: kafka2i / kaftui / ktea / yozefu / kaskade / ktui
-
Into the YAKR (Yet Another Kafka Replacement) category comes KafScale, an Apache 2.0 licensed Kafka-on-S3 broker written by Alexander Alten. It has support for Iceberg and SQL.
-
Good write-up from Sky Kistler at Reddit on how they migrated their 500+ EC2-based Kafka brokers to k8s-hosted Strimzi, with some impressive numbers - 500+ brokers serving tens of millions of messages per second and storing over a petabyte in live topic data.
-
🔥 Gwen Shapira’s 2017 QCon talk Streaming Microservices: Contracts & Compatibility is one that I keep coming back to over the years. Loosely-coupled services need contracts; just because you’re using Kafka and not REST, it doesn’t mean you escape that truth.
-
Tansu is a replacement Kafka broker written in Rust, with the interesting twist that it uses Postgres, S3, or SQLite for its storage. There have been some interesting recent blog posts, including two on internals performance tuning (1 2), as well as deploying it on a t3.micro instance on AWS. And, if you enjoy lengthy conversations, you’ll want to check out this 3.5hr interview between Stanislav Kozlovski (a.k.a. "2 Minute Streaming") and the author of Tansu, Peter Morgan.
-
A good roundup from Kafka PMC Chair Mickael Maison looking at some of the milestones in the Kafka project in 2025.
Analytics 🔗
-
🔥 An excellent real-world troubleshooting blog post from Maryna Kryvko and Ivan Potapov at Zalando, describing Elasticsearch performance issues and how they resolved them.
-
Sirius is a "GPU-native SQL engine" that plugs in to DuckDB, claiming to be an impressive "10× faster than DuckDB and 60× faster than ClickHouse at the same hardware rental cost".
-
A paper from Snowflake and researchers at University of Technology Nuremberg looks at some of the optimisations possible in Snowflake: Pruning in Snowflake: Working Smarter, Not Harder.
-
A look under the covers at how ClickHouse handles strings, by Artem Golubin.
-
Vasudev Maduri demonstrates the performance improvements in BigQuery from history-based optimisation.
Stream Processing 🔗
-
A deep-dive (of course) from regular Anton Borisov, looking at Flink 2.2’s improvements to Delta Join, including support for CDC Upserts.
-
Riskified’s Gal Krispel has an interesting talk on YouTube about using Flink SQL and DataStream together to overcome some issues they found when using SQL alone. The talk is in Hebrew but the auto-translation of the captions by YouTube is good enough to follow along.
-
Somewhat of an interloper to this blog, but Apache Spark has just released version 4.1. This blog post covers some of the new features, including Spark Declarative Pipelines (SDP), as well as lower-latency streaming capabilities with "Real-Time Mode" (RTM) implemented with SPARK-53736.
-
🔥 Jonas Geiregat has a useful discussion of managing Kafka Streams and its memory usage.
Data Platforms, Architectures, and Modelling 🔗
-
🔥 Jesus Gomez at Fresha has a good blog post looking at some of the required changes to their data modelling approach when migrating from Snowflake to StarRocks.
-
Dejan Menges has written up a two part series about Vinted’s event driven platform. It’s fairly high-level, and I’m hoping there’ll be a part 3 (if not more) that takes a deeper dive at some of the specifics.
-
Saubhagya Awaneesh and colleagues at Grab have published details of their real-time customer data platform built on technologies including StarRocks, Flink, and Kafka.
-
🔥 Mark Rittman has an insightful post looking at a fifty-year cycle of tools promising to democratise data work, each delivering genuine value while leaving the fundamental need for specialists stubbornly intact.
-
We’ve been trying to prise Excel from our users' hands for decades with no success, regardless of the shininess of the replacement. Jelle De Vleminck lays out an argument for why this is so, and why it’s perhaps a misguided goal.
-
Joe Reis on Data Identity Politics and The Kimball vs. Inmon War (Bill Inmon recently re-published some of his material, always worth reading).
-
The hype around Data Mesh may have subsided, but it’s still an interesting concept. Sebastian Werner and his colleagues at ThoughtWorks have taken a look at where data mesh is at in 2026.
Data Engineering, Pipelines, and CDC 🔗
-
A breath of fresh air from Julian Hurault in his Boring Engineer Manifesto.
-
James Carr has a good explanation of the Transactional Outbox Pattern.
-
Good stuff from Tim Castillo, discussing how he structures his data pipelines, and in a separate post going into depth on the bronze layer
-
Details from Yupeng Fu and colleagues at Uber of how they do pull-based ingestion from Kafka into OpenSearch at scale.
-
Data quality may not be the most exciting of subjects, but without trust in data, an organisation’s data strategy can soon unravel. Richard Glew goes through some useful principles and ideas for data testing in this post.
-
I don’t care if I’m featuring him three times in one edition; Joe Reis writes really good stuff. This post looks at the trope that "data modelling is dead". Except as we all know, it really oughtn’t be.
-
They may seem obvious enough (or so the Reddit crowd thought 🙄), but the very fact that this blog post about cost efficiency techniques in data engineering even needs writing suggests that not everyone is aware of them.
-
Debezium 3.4.0.Final has been released, including a bunch of new features along with support for Kafka 4.1.1 and Postgres 18.
-
An interesting set of posts from Ngoc Doan in which they describe their ETL pipeline built with Go and Postgres.
-
🔥 Chris Gambill has some tough words, but ones that I agree with: The Mid-Skill Data Engineer is In Trouble.
-
Data contracts are not just theoretically a good idea—they’re also used in the real-world. Andrew Jones has a list of some implementations with details of each.
-
Some useful details of the operational discipline that Agoda have adopted for running their financial data pipelines.
-
Oscar Ligthart and Rodrigo Loredo from Vinted describe how they manage decentralized, domain‑owned data pipelines with an Airflow-based orchestration system.
-
A great hands-on post from Conor Gallagher at Zalando on contributing a fix to Debezium for an issue they were having with WAL growth and logical replication.
-
🔥 An excellent post from Simon Späti: A Diary of a Data Engineer. Fantastic background and narration, and a crucial section: What I Know Now That I Wish I Knew Then.
Open Table Formats (OTF), Catalogs, Lakehouses etc. 🔗
-
As fun as it is importing a ton of Hadoop dependencies every time we want to use Parquet /s there might be a better alternative—and my colleague Gunnar Morling is building a proof of concept called Hardwood as a minimal dependency implementation of Parquet. (I also like the Parquet / Hardwood wordplay ;) )
-
Catalogs are, for me, one of the most confusing aspects of the data platform ecosystem. The term is so overloaded, and numerous products in the space overlap in functionality too. Hari Thatavarthy does a good job of explaining the evolving role of the data catalog. For a Flink-specific spin, check out my primer on catalogs in Flink SQL.
Iceberg 🔗
-
Ananth Packkildurai has written a blog post detailing Zeta’s AWS S3Tables/Iceberg-based lakehouse, with further details about the ingestion process here.
-
Neelesh Salian has built a tool called floe to do policy-based table maintenance for Iceberg tables.
-
🔥 Digging into the guts of the Iceberg catalog spec, Ananth Packkildurai highlights some of the issues that the overall "compatibility" guarantee might mask.
-
You can now use Iceberg with DuckDB from the comfort of your web browser—Carlo Piovesan and colleagues show how.
-
Shaurya Rawat has written up a good exploration of the different ways of getting data from Kafka to Iceberg.
-
Ryft are building a product around Iceberg, and with it some good blog posts. These include a discussion of the challenges of streaming with Iceberg, strategies for materialising CDC data into Iceberg, catalogs in Iceberg, and a look at the readiness of different engines for Iceberg v3.
-
Alex Merced and Andrew Madson are collecting responses for their survey of the Iceberg ecosystem. It’s open until 31st January for anyone using Iceberg to participate in.
-
Iceberg on GCP? Oscar Pulido looks at the three options available in the BigQuery ecosystem, whilst Emir Erez describes Trendyol’s GCP-based Iceberg data lakehouse.
-
Speaking of GCP, Talat Uyarer & Alex Stephen have published details of the public Iceberg datasets on GCP, starting with NYC Taxi. You do need to be set up with GCP to access it (I got excited at first, seeing the Iceberg on DuckDB in browser functionality that I mention above!).
Delta Lake and Hudi 🔗
-
I enjoyed this post from Prem Vishnoi in which they examined the assumption that Delta is solely a "Spark thing", and look at writing to Delta from Flink and Kafka directly. If this is a thing you’re wanting to do, you might also be interested in my previous article about writing to Delta from Flink SQL.
-
🔥 Hudi originated from Uber, and this in-depth blog post from Prashant Wason and colleagues at Uber describes in detail its use and deployment architecture, along with some impressive figures—6 trillion rows ingested per day, 350 PB stored, etc.
-
A couple of interesting write-ups of how Hudi is used in data platforms, from Zupee and Funding Circle.
RDBMS 🔗
-
Thomas Kejser takes a nice gnarly look at query execution specifics in one of the TPC-H benchmark queries.
-
🔥 Marc Brooker ponders what a database designed specifically for SSDs might look like.
-
A couple of Postgres internals posts, with Radim Marek looking at Postgres arrays and Tomas Vondra digs into the paradoxical finding that more memory in Postgres isn’t always better.
-
MySQL was bought by Oracle in 2009, and the pattern of commits to it has changed markedly in the last year according to Otto Kekäläinen, who argues that it’s no longer true open source, and that users should consider adopting MariaDB instead.
-
Details from Dhyanam Vaidya and colleagues at Uber describing how they implement load management on their MySQL-based in-house distributed databases Docstore and Schemaless.
General Data Stuff 🔗
-
SQLNet is a social media platform created by Vladyslav Len, in which all interactions are by SQL. Seriously. Try it out!
-
Orchestra’s Hugo Lu posits that Snowflake and Databricks are hitting their market ceiling.
-
Squirreling is a ~9 KB SQL engine with zero external dependencies for running SQL queries in the browser. Their blog post explains the background to it, and why tools like DuckDB-Wasm alone aren’t sufficient.
-
Details from Phillip LeBlanc at SpiceAI on their use of Apache DataFusion.
-
Jon Anderson explores FoundationDB (a distributed K/V database) in this blog post.
-
The team behind Responsive have pivoted from Kafka Streams to launch OpenData, "a collection of open source databases built on a common, object-native storage and infrastructure foundation."
-
A fun write-up from Tomás Senart at Axiom detailing an optimisation project on EventDB, their in-house database, eventually getting it to deliver 178 billion rows per second throughput.
AI 🔗
I warned you previously…this AI stuff is here to stay, and it’d be short-sighted to think otherwise. As I read and learn more about it, I’m going to share interesting links (the clue is in the blog post title) that I find—whilst trying to avoid the breathless hype and slop.
Well, gosh darnit. Didn’t this just blow up in the last month? Whilst Sonnet 4.5 was just trundling along giving ammo to the AI deniers, Opus 4.5 has come and blown things out of the water. If you’ve no idea what I’m talking about, have a read / listen to Casey Newton’s recent thoughts.
In an industry in which the term super-exciting has been devalued, what is happening now is, genuinely, SUPER-SUPER-EXCITING.
I’ll open with DHH’s tweet:
You can't let the slop and cringe deny you the wonder of AI. This is the most exciting thing we've made computers do since we connected them to the internet. If you spent 2025 being pessimistic or skeptical on AI, why not give the start of 2026 a try with optimism and curiosity?
— DHH (@dhh) January 3, 2026
Not a fan of DHH? How about Charity Majors: 2025 was for AI what 2010 was for cloud.
Stop and think about that. In the late 2000s cloud was this thing that of course the vendors were trying to sell (vendors gonna vend), and some people got, but most people sniffed at or ignored. Now 15 years later…who’s ignoring cloud?
Anyway, I’ll save my pontificating for another blog post. But there are a ton of interesting links to share with you about AI, so here they are. You’ll have to excuse the lack of narrative on each one; there are just too many this month :)
Looking forward & looking back 🔗
-
Sebastian Raschka: The State Of LLMs 2025.
-
🔥 Andrej Karpathy: 2025 LLM Year in Review.
-
Simon Willison: 2025: The year in LLMs.
-
Cedric Clyburn: The state of open source AI models in 2025.
-
🔥 Simon Willison: LLM predictions for 2026.
-
Jakob Nielsen: 18 Predictions for 2026
-
Brent Ozar: Database Development with AI in 2026
Strategy & Ideas 🔗
-
antirez: Don’t fall into the anti-AI hype (lobste.rs comments).
-
🔥 Oxide and Friends podcast: Engineering Rigor in the LLM Age.
-
Eric Broda: Ambient Agents
-
If AI makes things so easy, are we going to need moats? Definitely. Both Data (Vikram Sreekanti & Joseph E. Gonzalez) and Personal Taste (Cong Wang).
Using AI—Product 🔗
-
Amaresh Marripudi: Building Swiggy’s Conversational AI Analyst (InfoQ summary)
-
🔥 Yuqing Zhang, Congzhe Su, Susan Liu: How Etsy Uses LLMs to Improve Search Relevance
-
Chintan Turakhia: Engineering More Reliable Transportation with Machine Learning and AI at Uber (an old post, but still interesting)
-
Karthik Ramgopal and Daniel Hewlett: Lessons Learned from Building LinkedIn’s First Agent: Hiring Assistant
Using AI—Platforms 🔗
-
Jason Shang and Artem Nabirkin: Inside the feature store powering real-time AI in Dropbox Dash
-
Rohan Varshney: Lyft’s Feature Store: Architecture, Optimization, and Evolution
-
🔥 Paul Baranowski: Simplifying Large-Scale LLM Processing across Instacart with Maple
-
Karthik Ramgopal and Prince Valluri: Scaling Agents and MCP at LinkedIn
Using AI—Engineering 🔗
-
🔥 Bryan Cantrill: Using LLMs at Oxide
-
Paarth Chothani: Uber’s RAG‑Powered Slack Bot for On‑Call Support (2024)
-
Ethan Mollick: Claude Code and What Comes Next
-
Max Charas and Marc Bruggmann—Background Coding Agents at Spotify part 1 / part 2 / part 3
-
Andrew Swerdlow, Sepideh Setayeshfar, Kushan Mehta, and David Rahn: Doubling AI Code Acceptance at Roblox
-
Eric Chima and Leonardo Quixadá: Scaling Unit Test Coverage using AI Tools at the New York Times
-
Léo Valette: Using AI to Automate Code Migrations at BlaBlaCar
Philosophy & Society 🔗
-
Simon Willison: What are the ethics around using LLMs to port open source code?
-
🔥 Mariano-Florentino Cuéllar et al: Shaping AI’s Impact on Billions of Lives
-
Casey Newton: Debunking the AI food delivery hoax that fooled Reddit
-
Dan Birks: Evidence-based policing and the rise of AI
-
🔥 Rutger Bregman: Fighting for Humanity in the Age of the Machine (BBC Reith Lectures 2025, 4/4)
And finally… 🔗
Nothing to do with data, but stuff that I’ve found interesting or has made me smile.
Think 🔗
-
🔥 Brad Stulberg: 25 Rules for Living an Excellent Life in a Chaotic World
-
Rion Williams: Striking a Balance: Working Fully Remote for Nearly a Decade
-
Candost Dagdeviren: The Unbearable Joy of Sitting Alone in A Café
-
Rutger Bregman: BBC Reith Lectures 2025: Moral Revolution
Nerd 🔗
-
ZOOMQUILT 2 I don’t even know what this is, but it’s damn cool. And liable to give you motion sickness ;)
-
Some good lists of engineering blogs to follow from folk on Hacker News, and Alexey Milovidov at ClickHouse.
-
Nikita Prokopov: It’s hard to justify Tahoe icons (not really Mac-specific, just a useful study in UX patterns)
|
Tip
|
|
