Interesting links - August 2025

Published by in Interesting Links at https://rmoff.net/2025/08/21/interesting-links-august-2025/

Not got time for all this? I’ve marked 🔥 for my top reads of the month :)

Tip
You can find previous editions of Interesting Links here.

Data Engineering 🔗

Data in Action 🔗

  • Building your own data ingestion framework may be a siren song for many, but Cloudflare operate at the kind of scale where it’s perhaps worth it. Read about Jetflow here.

  • Nubank have published a series of interesting blog posts about their use of stream processing, including with Kafka and Flink. There’s also a meetup recording (in Spanish) that looks like it has lots more details.

  • Details from UK Bank Monzo on their Go-based fraud prevention platform.

  • 🔥 Excellent blog post from Anton Borisov at Fresha detailing why and how they adopted StarRocks after finding that Snowflake "wasn’t cost-effective — or fast enough — for chatty, near-real-time product and operational analytics.".

  • Guidewire have published a couple of interesting blog posts looking at their data platform design, testing, and optimisation.

Apache Kafka 🔗

Open Table Formats & Catalogs 🔗

Let’s be honest, it’s mostly just Apache Iceberg…😅

Stream Processing 🔗

  • Apache Flink 2.1.0 has been released

  • Ming Hung Tsai wrote a three part series showing how you could use Kafka Streams to implement a ticket reservation system (also discussed in the Reddit thread linked above)

  • 🔥 Sometimes the old ones are the best—and this article from Tyler Akidau nine years ago is still just as important to read today if you’re thinking about stream processing: Streaming 102: The world beyond batch—The what, where, when, and how of unbounded data processing.

  • LinkedIn’s Jiangjie Qin, a PMC member for both Apache Flink and Apache Kafka, spoke at QCon SF about Stream and Batch Processing Convergence in Flink

  • Should you use a hammer to tighten a screw? Should you try and express all your stream processing needs in SQL? also no.

  • FLIP-541 is a proposal to make PyFlink more Pythonic, and looks to have wide support in the community.

  • Databricks announced the public preview of a real-time mode for Spark Structured Streaming. It will be donated to the Apache Spark project but is currently only available on Databricks.

RDBMS + CDC 🔗

General Data Stuff 🔗

  • 🔥🔥 Hot off the press is another banger from Jack Vanlightly, this time looking at A Conceptual Model for Storage Unification. If you’re interested in things like writing Kafka data to Iceberg, this is a vital foundation for understanding the design considerations and trade-offs.

  • How Klaviyo use Ray for their scalable data processing, training, and optimization

  • Prompted by a talk that Tesla gave about ingesting metrics into ClickHouse, Javier Santana at TinyBird set out to reproduce the feat using a 50-node ClickHouse cluster. In a sense these exercises are somewhat BSD and clickbait-y, but I do like the clear steps and detail that he showed in the blog post :).

  • 🔥 If anyone is going to need to build their own time-series database (TSDB), Datadog is going to be one of the top contenders. In this blog post they write about how they built it using Rust and the benefits they saw (60x ingest, 5x query). Also interesting is the history of their previous TSDB platforms.

  • FastLanes describes itself as a Next-Gen Big Data File Format, aimed as a replacement to columnar formats such as the somewhat-ubiquitous Parquet. Beyond several conference papers it’s unclear if there’s any adoption of the format in the wild yet.

And finally… 🔗

Nothing to do with data, but stuff that I’ve found interesting or has made me smile.


Note
TABLE OF CONTENTS