Checkpoint Chronicle - November 2023

Published by in Checkpoint Chronicle at https://rmoff.net/2023/11/14/checkpoint-chronicle-november-2023/

Note
This post originally appeared on the Decodable blog.

Welcome to the Checkpoint Chronicle, a monthly roundup of interesting stuff in the data and streaming space. Your hosts and esteemed curators of said content are Gunnar Morling and Robin Moffatt - feel free to send our way any choice nuggets that you think we should feature in future editions.

  • Apache Iceberg has some notable big adopters, such as Netflix (where it was created), and Apple. In this fascinating talk from QCon New York 2023, Stephen Wu from Apple talks about using Iceberg as a streaming source into Apache Flink, including the reasons why they included Iceberg in the pipeline and not just Apache Kafka throughout.

  • Several long-time developers contributing to the Kafka Streams project within Apache Kafka have launched a startup called responsive —and with it an absolute barn-stormer of a blog post: A Size for Every Stream: The Expert’s Guide to Sizing Kafka Streams

  • In this excellent talk from QCon London 2023 and just published on the QCon website (alongside the previous blog on the same subject ), Matt Boyle and Andrea Medda from Cloudflare detail their journey to 1 trillion messages in Kafka and the issues and learnings they had along the way. Some of the points included dealing with message compatibility whilst retaining loose-coupling as they moved away from a monolithic implementation, internal tooling and libraries to improve developer velocity, and accurate monitoring and instrumentation.

  • From the Decodable stables this month we have a useful set of blogs: 

  • A sign of the increasing maturity of the stream processing space is that we are moving beyond solely “Hey I implemented Hello World with Flink!” blog posts—although there’s nothing wrong with these too, and I always applaud learning in public—and onto the gnarlier topics that come with running stream processing applications for real. This post from Yaroslav Tkachenko discusses a zero-downtime technique for deploying new versions of Flink applications based on the blue-green deployment pattern .

  • Arroyo are one of the many new companies in the increasingly-crowded streaming space, and earlier this year published this excellent blog that explains streaming SQL clearly along with a discussion of dealing with updates, comparing two different approaches that it calls Dataflow Semantics and Update Semantics.

  • Another new kid on the block in streaming is Epsio, who published this useful explainer of their implementation of a streaming SQL engine .

  • A year after their first post , the Netflix team returned with an update on the Data Mesh platform, detailing their move from individual Flink operators to adoption of Flink SQL . It’s interesting to see how Flink is used, and also detail around exactly how companies at this scale implement streaming technologies.
    It’s worth noting that Netflix’s Data Mesh != Zhamak Dehghani’s Data Mesh - it’s just a naming collision, and one which Netflix discusses here .

  • The Open Table Format (OTF) squabbles continue, with this blog from Iceberg co-creator and Tabular co-founder Ryan Blue discussing flaws in Apache Hudi’s Atomicity, Consistency, and Isolation guarantees, which one of Hudi’s co-creators (and Onehouse founder), Vinoth Chander, disagrees with in this post . Databricks’ Ali Ghodsi even threw his tuppence worth in too.
    It seems there are at least two angles to this argument: 

    • Arguments over the technical correctness of claimed features (ACID, etc)

    • Strategic positioning; Both Databricks and OneHouse see the OTF future as being multiple formats with some form of interoperability between them (Uniform and OneTable, respectively). The impression that I’ve got is that Tabular see Iceberg as the format around which others will and should converge.

  • A common driver for moving from batch to stream processing is the need to get fresher information in front of users of an application. That’s what happened at Vinted, where they migrated their batch-based load of Elasticsearch to one powered by Flink .

  • Thanh Tung Dao has compiled two very useful lists

  • Replication slots not advancing in certain circumstances used to be a notorious source of headaches for users of change data capture with Postgres. The engineering team at Zalando— who are heavy users of Debezium —took a stab at addressing this, using keep-alive messages in order to address this issue. They discuss how they patched the Postgres JDBC driver in this in-depth blog post . No more unexpected WAL growth!

Paper of the Month 🔗

Who doesn’t love to read them a good paper? In this spirit, we’re going to reference one research or industry paper (either classic, or just hot off the press) in the field of data processing and streaming each month, starting with one of our all-time favourites:

📄 One SQL to Rule Them All: An Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables ( arXiv:1905.12133 )

In this paper from 2019, Edmon Begoli et al. discuss how the “pervasive use of time-varying relations, robust event-time semantics support, and materialization control can substantially improve the ease-of-use of streaming SQL”. Definitely an inspiring read!

Events 🔗

That’s all for this month! We hope you’ve enjoyed the newsletter, and are all-ears for any feedback or suggestions you’ve got.

Gunnar ( LinkedIn / X / Mastodon / Email )

Robin ( LinkedIn / X / Mastodon / Email )


TABLE OF CONTENTS