Interesting links

🔥 Not got time for all this? I’ve marked my top reads of the month :)
📧 Want to receive this monthly round-up as an email? Subscribe to my Substack where I cross-post the same content
🔗 Medium posts often skulk behind a gate, so I’ve hyperlinked to the Freedium version. You’ll see [Medium ↗] next to each link if you prefer the original.

Data Engineering 🔗

🔥 Amongst all the background noise of ETL vs ELT vs ZeroETL, this primer from Ben Rogojan (a.k.a. "The Seattle Data Guy") is a great reminder of the actual 'T' that needs doing to our data, wherever it is that we do it.
Ask ten engineers in the data space the difference between job titles and you’ll get a dozen opinions. In a sense this doesn’t matter, but this post is useful for laying out a common understanding of the breakdown of the tasks involved in the different 'buckets'.
Details of a migration from a Hive-based data warehouse to one on Apache Iceberg.
A good look from Kayak at realworld design choices and their implications, including the impact of not using schemas in Kafka, and why they ultimately moved away from Snowflake to Trino instead.
Boring catalog is a new file-based catalog for Apache Iceberg, with a write-up here.
More good content from Joe Reis' upcoming book on data modeling:
Good writeup of some of the more gnarly challenges of modelling and aggregating user interactions at Vimeo in Clickhouse.
🔥 Sean Falconer writes about How to Clean and Enrich Data Before It Lands in Snowflake.

🔥 Gunnar Morling had an interesting thought experiment that the internet liked: What If We Could Rebuild Kafka From Scratch? (HN, lobst.rs).
LinkedIn talked about their new Northguard project, slated to be a replacement for Kafka.
Platformatic released a new node.js Kafka client and an accompanying blog explaining why.
A useful look at how brokers and controllers communicate when you deploy Kafka using KRaft.
A good introductory article from Vu Trinh: If you’re learning Kafka, this article is for you.
Details of how S2 handles time.

🔥 I published a second blog about Flink SQL, covering joins and changelogs. If you missed last month’s, check it out at It’s Time We Talked About Time: Exploring Watermarks (And More) In Flink SQL.
A couple of interesting posts from Yaroslav Tkachenko about challenges in data streaming.
Details of Flink’s new Materialized Tables feature in this blog from Alibaba.
GetInData wrote up an article looking at the considerations to make if choosing between Kafka Streams and Apache Flink.
🔥 An excellent talk given by Adi Polak at QCon: Streaming All the Things — Patterns of Effective Data Stream Processing.

Agoda published a good blog post with details of how they’re using GPT to optimise stored procedures, with some impressive results.
🔥 Thoughtful article from Joe Reis on What Does AI Do to The Craft of Software and Data Engineering?.

Synadia and CNCF has a dust-up over the ownership of the NATS trademark, followed by a public reconciliation. El Reg has a summary of it here.
RedMonk are my favourite analyst firm, keeping things very real and grounded in their writing. They have a nice article about OSS here.
I played around with Confluent’s new Tableflow feature, looking at how it could be useful for initial data exploration in a project.
🔥 A really good podcast/video in which Kris Jenkins talks to Andrew Lamb about Apache DataFusion.
Details from SpiralDB of plans for Vortex v1.0, a new columnar file format.
Details of how Uber migrated from Mesos to Kubernetes.
Fluss, from the team at Alibaba, has been proposed as an Apache Incubator project.

🔥 Details of how Netflix use Kafka, Flink, and Druid in their Ads platform.
Lyft’s Real-Time Spatial Temporal Forecasting, built with tools including Kafka, ClickHouse, and Beam/Flink.
Nice detail of how Zomato use Flink SQL in their real-time ads platform (they also wrote about their Flink adoption previously).
Three deep-dive blogs from Kakao on their adoption of Flink CDC to get data into Iceberg, and their experience operating Iceberg.
Picnic describe their realtime analytics platform built with Kafka for ingest, and Clickhouse for processing (via refreshable materialized views) and serving.
It’s one thing building a BI platform—but the journey doesn’t stop there; you need to also see how people are using it, which is what Halodoc built a system to analyse.

Nothing to do with data, but stuff that I’ve found interesting or has made me smile.

📧 Want to receive this monthly round-up as an email? Subscribe to my Substack where I cross-post the same content
If you like these kind of links you might like to read about How I Try To Keep Up With The Data Tech World (A List of Data Blogs)