Welcome to the 10th edition of Interesting Links. I’ve got over a hundred links for you this month—all of them, IMHO, interesting :)
I’ll start off by shamelessly plugging the articles that I published this month:
-
It turns out, I’ve been thinking about Agents and MCP all wrong. It was a bit of a 💡 for me, and if you’re trying to grok wtf agents are, give it a read and let me know if it helps you.
-
(AI) Smells on Medium - a proper ranty post, inspired by compiling this very newsletter. There is so much shit being published these days; a lot of it on Medium. The enshittification of the internet is real, and it makes me sad :(
-
The post got discussed on lobste.rs and one comment made me smile:
Ironically this article unwittingly showcases the source material for slop style grammar
Have a read and see if you agree ;)
-
-
Details of how I helped build the demo for the day 2 keynote at Current in New Orleans last month using Kafka, Flink, and LLMs.
RFC 🔗
For you youngsters: Request For Comments
Links - too many, too few, or just right? 🔗
This newsletter has grown, both in audience and number of links. Back in February there were fewer than two dozen links. This month, there’s nearly 150 😲.
I’d love to hear from you whether you would like to see fewer links, or if the current amount is about right. Also let me know if there are areas of which you want to see more (or less).
Use the comment section at the end of this article to feedback, or find me on Twitter, LinkedIn, etc.
Email? 🔗
Would you prefer to read this as an email? If there’s the appetite I’m happy to set something up, either just x-posting to Substack, or perhaps something self-hosted like ListMonk.
Again - leave a comment below, or find me online :)
Call for Papers - Current 2026 🔗
The Call for Papers for both Current London and Current Bengaluru are open, closing on December 22nd.
|
Tip
|
If you need a hand with writing your abstract, you might find these articles that I’ve written helpful: And if you’re a speaker, check out the excellent article titled "The Silent Crowd" from Sam Harris which includes this important point (amongst others):
|
Anyway, On with the Links 👇 🔗
Not got time for all this? I’ve marked 🔥 for my top reads of the month :)
Kafka and Event Streaming 🔗
-
A trio of articles from Zeinab Dashti about exactly-once processing in Kafka, using the Transactions, the Outbox Pattern, and the Listen-to-Yourself Pattern
-
A nice post from Stefan Kecskes exploring what KRaft is and what the migration process looks like.
-
An interesting blog post from last year in which Andy Pearce gives an introduction to BlazingMQ.
-
Taking a break from his hOT TaKEs about the imminent demise of Kafka, Stanislav Kozlovski shared on Reddit a useful comparison of the floor price of Kafka on hosted services. Predictably, there was plenty of disagreement in the ensuing thread about the criterion used in the comparison—and this discussion in itself was useful as food for thought when making these comparisons for yourself.
-
🔥 I love this series of posts from Geoff Williams in which he hooks up a home-built weather station to stream readings to Kafka via Home Assistant.
-
Luthra Sahil writes about building an email sync platform based around Kafka.
-
A couple of interesting posts from Ian Duncan, covering Event Design for Streaming Systems and details of JSON Schema. (If you struggle reading the grey-on-black text, opening it in Safari and using the Reader view works well).
Stream Processing 🔗
-
🔥 Regularly featured in these roundups in recent months is Yennick Trevels, who this month brings us a very cool Interactive Kafka Streams Architecture Simulation.
-
Expedia’s Vishal Sharma writes about tips for consuming from multiple topics with Kafka Streams.
-
MongoDB are previewing built-in stream processing capabilities that can output to Iceberg tables directly.
-
An excellent explainer from Sean Falconer on Real-Time Anomaly Detection with Apache Flink
-
Cool stuff from the team at Grab, showing how they use Flink to test the compliance of Kafka messages with defined data contracts.
-
🔥 Anton Borisov has a deep-dive exploration of Evolution of Streaming Computation Models, looking back to Chandy–Lamport, and then at Flink, Fluss, and beyond.
-
This post from Katya Gorshkova about using Flink to filter data from Kafka is a great example of one of my favourite ways to learn something - understand the landscape and context, and write simple examples to build up understanding bit by bit. The second part just dropped too, looking at deploying Flink with Kubernetes.
-
Three talks about Flink at this year’s excellent P99Conf:
-
Konstantinos Karavitis has an interesting post about the design pattern known as "hexagonal architecture" and its application when building microservices with Flink.
-
A summary from Jennifer Ebe about Delta Joins, introduced into Flink in FLIP-486 (with support for it added in Fluss 0.8), and covered in more detail in this blog post from Alibaba.
-
Rion Williams is seeking feedback from the Flink community on his idea of a DemultiplexingSink.
-
I’ve written previously about the challenge of streaming data from multiple sources to multiple Iceberg tables with Flink; Apple’s Swapna Marru writes about how this is made easier with the new Flink Dynamic Iceberg Sink in Iceberg 1.10.
-
Ayden Adair built a very cool Flink-powered candy bowl for Halloween, and made a video to show it off.
Data Platforms, Architectures, and Modelling 🔗
-
Details from StarRocks of how, and why, Cisco WebEx migrated from Pinot to StarRocks for the real-time analytics.
-
A summary from Eran Stiller of Stripe’s recent InfoQ talk about migrating petabytes of data between systems with zero downtime.
-
Rene Schallner has a cool blog post about using TigerBeetle to build a high-performance ticketing system.
-
Luis Medina and Ajit Koti from Netflix bring us the second part of a series about the real time distributed graph platform that they’re building, looking in this post at the use of Cassandra for the storage layer.
-
🔥 Cool stuff from the team at Uber, detailing how they instrument I/O calls from Spark and other technologies and use Kafka and Pinot to store and monitor performance of a data lake that is petabytes in size.
-
A good recap from Rahul Joshi at Capital One of the history of data lakes through to data lakehouses, along with an analysis of OTFs and their perhaps-inevitable convergence. Pair it with this from Spotify’s Kirill Bobrov for a second pass at the concepts.
-
🔥 Fascinating detail from Agoda’s Save Pavanavimutti and Art Nanakorn about how the booking system works.
-
Nice overview from Fernando Franco of ways to scale the data storage layer in system design.
-
Gunnar Morling argues why it’s more nuanced than simply using Postgres to replace Kafka.
-
It’s not often you see someone talking about migrating off DuckDB, but that’s exactly what Bauplan have done—Jacopo Tagliabue explains why Bauplan moved from DuckDB to DataFusion.
-
If you like this kind of thing, Matt Turck’s Machine Learning, AI & Data Landscape for 2025 has been published along with commentary.
-
Ananth Packkildurai has a good article about the data systems needed to support modern "go to market" operations including marketing and sales and taking into account various privacy laws.
Data Engineering, Pipelines, and CDC 🔗
-
Is it too meta, in a list of interesting links, to link to a list of links? Regardless, this list from Faruk Tufekci of resources for analytics engineers is really useful.
-
Detailed articles from Jan Zedníček looking at how to use dbt to handle and implement SCD2.
-
Cutting over from historical to realtime data in a pipeline can be a tricky problem—Nicoleta Lazar from Fresha has a nice article detailing how they do it with Snowflake, Flink, and Airflow.
-
I’m a fan of the Write-Audit-Publish (WAP) pattern, and enjoyed this article from Soumil Shah showing how to do WAP with Amazon S3 Tables.
-
🔥 An excellent roundup of the Q&A that Simon Späti, Mehdi Ouazza, Julien Hurault, and Ben Rogojan did based on common questions from Reddit’s r/dataengineering. Lots of useful content here.
-
LinkedIn’s Gaojie Liu and Jialin Liu explain how the ingestion pipeline for Venice works.
-
Hans-Peter Grahsl has published a nice Docker Compose to spin up Flink, Fluss, and LanceDB. The README has a good overview of how and why you might want to experiment with the particular stack.
-
The TinyETL project from Alex Nemeth looks interesting for simple full-load data movement between standard formats and RDBMS.
-
🔥 Excellent detailed post from Andrew Zhang and Sanketh Balakrishna at Datadog explaining how they use Kafka Connect and Debezium to replicate from Postgres to Elasticsearch and Iceberg, including handling schemas and more.
-
🔥 If the above article from Datadog whet your appetite for what you can build with Kafka Connect, you’ll love this practical and clear introduction to Kafka Connect and its components and concepts from Stefan Kecskes.
Open Table Formats (OTF), Catalogs, Lakehouses etc. 🔗
Lots of links in this category this month! I’ve split out some of the technology-specific stuff into their own sections below.
-
Alexandre Bergere has analysis of Databricks' acquisition of Neon and what it means for their platform.
-
OTF bake-off blogs might be old-hat, but this one comparing Iceberg/Delta/Hudi from Gabriel Popa adds a new spin to it - data sovereignty requirements for Switzerland.
-
Discussion in the Apache community about optimisation proposals for Parquet and how to move them forward within the project structures.
-
🔥 Umesh Dangat and Toby Cole from Yelp with details of their adoption of Apache Paimon for their "streamhouse".
Apache Fluss 🔗
It’s not a table format…it’s not a lakehouse…it’s…Fluss ¯\_(ツ)_/¯
(If you’ve got a better category or mental-model for me to bucket it into, let me know in the comments below!)
-
Giannis Polyzos and Jark Wu have details of the Fluss 0.8 release
-
A useful overview from Alibaba of Fluss and Paimon; what they do, where they overlap, how to decide if they fit your requirements.
-
Real-world details of Fluss in action in this blog from Xinyu Zhang and Lilei Wang at TaoBao, looking in detail at why they adopted it and how they use it.
-
🔥 The Future Data Systems Seminar Series from Carnegie Mellon University Database Research Group is a very cool free resource with weekly deep-dives from experts in the industry. The lecture on 8th December is from the original creator of Fluss, Jark Wu. All the talks are recorded and available online afterwards.
Apache Iceberg 🔗
-
Two years ago one would have thought that Hell was freezing over (Iceberg! freezing! geddit?!) with Databricks announcing full support for Iceberg v3, but following the acquisition of Tabular and the wide adoption of Iceberg in the industry, it seems a pretty sensible move.
-
Two good Iceberg blogs from Jack Vanlightly this month.
-
The first covers the nitty-gritty of a very real "rubber meets the road" integration question; if you’re writing from Kafka to Iceberg, how do you square the circle of choosing what kind of access to optimise the data layout for?
(Jark Wu has an interesting reply to it here.)
-
In his second post he looks at the idea of OTree Spatial Indexes for Iceberg.
-
-
Cloudera’s Ayush Saxena looks at how Apache Hive now supports Iceberg’s implementation of the Variant data type.
-
A good writeup from Manishankar Ravuri examining how deletions are handled in different versions of the Iceberg spec, and in Delta Lake.
-
The surest sign of solid adoption of a technology is when tools supporting its use spring up in parallel to the project itself. Jack Leitch from Whoop has written about Glacierbase, which they have written to manage schemas across their PBs of Iceberg tables.
-
🔥 Marc Selwan has written a cool front-end for the Iceberg Catalog,
iceberg.rest. -
🔥 This is amazing from Snowflake (a company not originally renowned for open source work): they’ve open-sourced pg_lake which enables you to access Iceberg tables from within Postgres. Open source and open standards FTW!
Apache Hudi 🔗
-
Apache Hudi 1.1 has been released, Shiyan Xu has the details.
-
Interesting summary of a talk from ad-tech company FreeWheel about their use of Hudi, including its replacement of a Spark/Presto/Clickhouse-based Lambda architecture.
-
Onehouse are definitely not pivoting away from Hudi, as they launch their faster-Spark runtime, Quanton with claimed performance improvements for Spark/Iceberg (wait, what?) workloads too.
-
A nice two-part deep-dive series about indexes in Hudi from Shiyan Xu.
RDBMS 🔗
-
Vlad Bokov has a good hands-on explainer of how Fresha reduced their 4TB Postgres database on RDS by 75%.
-
A handy reminder from Elizabeth Christensen about how to find your way around Postgres' catalog and system tables.
-
Useful notes from Murat Demirbas discussing a 2022 paper on Disaggregated Database Management Systems. Murat also presented on the topic at InfoQ, and Steef-Jan Wiggers has written up a summary of the talk.
-
🔥 You do test your Postgres queries, right? Right? This article from Radim Marek discusses his fork of RegreSQL and how it can be used to test a bunch of stuff including performance and query plan regressions.
-
The SQL standard working group has formally accepted a change to the SQL standard to add support for
GROUP BY ALL. -
DuckDB 1.4.2 LTS adds support for insert, update, and delete statements on Iceberg tables from DuckDB.
-
🔥 I loved this talk from Andy Pavlo at P99 Conf about practical research done into optimising query execution using techniques including—but not limited—to LLMs.
-
In a similar vein, Alper Çiftçi from Trendyol writes about their experience using AI to optimise Postgres queries.
-
Sometimes you don’t need an LLM to figure out how to optimise your queries; Nimisha Vernekar has a good primer on basic query optimisation techniques that it would behove anyone working with SQL to understand.
-
🔥 Snowflake might have enabled access from Postgres to Iceberg with their open-source pg_lake (see above), but DuckDB now has pg_duckdb, with which you can use DuckDB from within Postgres and thus access not only Iceberg but the multitude of other sources and types of data that DuckDB can read.
What a time to be alive in the data world 😁
General Data Stuff 🔗
-
🔥 I’ve loved using notebooks like Zeppelin and Jupyter for years, and several times people have recommended I look at Marimo - this article from Parul Pandey makes such a compelling case for it I really do need to take a look. Maybe a project for the quiet holiday period :)
-
RayforceDB is a new open-source (MIT license) columnar database less than 1MB in size, and written in C.
-
Lots of people were impacted by the Cloudflare outage earlier this month; what I found interesting in the excellent postmortem that they published was that it wasn’t DNS, and it was related, in part, to unexpected/unforeseen results from a SQL query.
-
There were some excellent talks at P99 Conf this year, including:
-
Almog Gavra - 8x Better Than Protobuf: Rethinking Serialization for Data Pipelines
-
Tanel Poder - xCapture v3: Efficient, Always-On Thread Level Observability with eBPF
-
Duarte Nunes - Timeseries Storage at Ludicrous Speed
-
Rachel Stephens & Adrian Cockroft - Performance Insights Beyond P99: Tales from the Long Tail
-
Andy Pavlo - ChatGPT Ain’t Got $%@& On Me! (discussed above, but so good it’s worth mentioning twice!)
-
-
The Thoughtworks TechRadar was updated recently. I find it interesting as a snapshot of how technologies ebb and flow in their use and adoption (in the ecosystem within which Thoughtworks operates). I published a very short summary of the relevant data entries from the radar.
-
Did you know that
/dev/nullis an ACID compliant database? ;)
AI 🔗
I warned you previously…this AI stuff is here to stay, and it’d be short-sighted to think otherwise. As I read and learn more about it, I’m going to share interesting links (the clue is in the blog post title) that I find—whilst trying to avoid the breathless hype and slop.
-
🔥 David Aronchick writes about the brutal reality gap between the vendors and the folk on the ground just trying to get shit done, in this insightful article: Two KubeCons, One Conference: While Everyone Demos AI Agents, Engineers Are Fighting With Syslogs
-
Thanks to LLMs, Stack Overflow is dead, long live Reddit! (I am paraphrasing; you can read the full paper with all the nuances and context here: The consequences of generative AI for online knowledge communities)
AI in the Enterprise 🔗
-
Details of AI/ML platforms from Lyft, Pinterest, and Netflix.
-
In an article that does what it says on the tin, Andrey Chubin goes through some of the critical mistakes that companies make when integrating AI/ML into their processes.
-
Like the previous article in the list here, this one from Dr. Janna Lipenkova is a look at the actual implementation of AI in a company - to the title of the article, It Doesn’t Need to Be a Chatbot
Data for AI 🔗
-
🔥 A detailed article from Lak Lakshmanan considering how data engineering patterns might change as we store and prepare data for use by Agents in the future.
-
Dmitry Pavlov describes how Clickhouse made their internal data warehouse "AI-first".
Coding with AI 🔗
-
Gergely Orosz (a.k.a The Pragmatic Engineer) chatted with Martin Fowler about how AI will change software engineering.
-
AI really does open Pandora’s box of new vectors for attack, which coupled with the race for development and adoption makes for a potent mix. Simon Willison has a good explanation of some of the prompt injection attacks.
-
Proving the above point, Anthropic have written up details of an attack that used Agentic AI.
Agents and MCP 🔗
-
🔥 I love this practical example from Thomas Ptacek that demonstrates what an Agent actually is : You Should Write An Agent .
-
This is one of the articles that made me realise that I’ve been thinking about Agents and MCP all wrong.
-
-
🔥 12 Factor Agents is a very practical guide from Dex Horthy (modelled on the idea of 12 Factor Apps) looking at all the practical considerations you should have when designing and productionising LLM applications.
-
A useful list of Agentic Patterns from Philipp Schmid.
-
Viktor Gamov recently did an excellent talk looking at How MCP Bridges LLMs and Data Streams
-
What’s the difference between prompt engineering and context engineering? And what is context engineering and why does it matter so much? The team at Anthropic have written a good blog post looking at these questions and more.
And finally… 🔗
Nothing to do with data, but stuff that I’ve found interesting or has made me smile.
Think 🔗
-
🔥 The more senior you become in your career, the more you need to be aware of the sometimes-unintended power of your words. Kelly Vaughn explains why in this article You might be carrying an invisible gun
-
PostHog’s Charles Cook argues that Collaboration sucks
-
Practical advice from Andrea Canton: Always Be Ready to Leave (Even If You Never Do)
Rant 🔗
-
🔥Daniel Fichtinger on why your project fucking sucks (why FOSS needs gardeners, not influencers).
-
Robin Wilding - 12 Things I’ve Heard Boomers Say That I Agree With 100%
Watch 🔗
-
Shashank Tomar - Strange Attractors
Nerd 🔗
-
Very cool look at ADS-B from Randy Au - Counting the planes overhead.
-
What happens when you run
rm -rf /? Kyle Kelley found out. -
🔥 I love the persistence that Raghu Saxena shows in this article looking at how WiFi is secured on British Airways flights.
|
Note
|
|
Just a reminder - leave a comment 👇 🔗
-
Is the current amount of links in this newsletter about right, or would you like to see fewer?
-
Are there any areas of which you want to see more (or less)?
-
Would you prefer to read this as an email?
Leave a comment below, or find me online :)
