rmoff's random ramblings
about talks

Interesting links - April 2025

Published Apr 22, 2025 by in Interesting Links at https://rmoff.net/2025/04/22/interesting-links-april-2025/

So. Many. Interesting. Links. Not got time for all this? I’ve marked 🔥 for my top reads of the month :)

Data Engineering 🔗

  • 🔥 Data Engineering: Now with 30% More Bullshit

  • 🔥 A good article from Andrew Jones on the concept of "shift left"

  • Data Model Smells

  • A love letter to the CSV format

  • The 2025 State of Analytics Engineering Report

  • Useful writeup from Anders Swanson on [Iceberg, the Iceberg REST Catalog Specification, and more

  • Georg Heiler - Upskilling data engineers

Kafka 🔗

  • 🔥 Taking out the Trash: Garbage Collection of Object Storage at Massive Scale

  • KIP-1150: Diskless Topics - Apache Kafka - Apache Software Foundation

  • Using Data Contracts with the Rust Schema Registry Client

  • ktea - a Kafka TUI client

  • Behind Sending Millions of Messages Per Second: A Look Under the Hood of Kafka Producer

  • Benchmarking Kafka: Distributed Workers and Workload topology in OpenMessaging Benchmark

  • Queues for Kafka, my opinion (see also: no true scotsman)

CDC 🔗

  • 🔥 A Deep Dive Into Ingesting Debezium Events From Kafka With Flink SQL

  • A really good illustration of how CDC can enable low-latency use of data from transactional systems without impacting the OLTP workloads

  • Best Practices for Flink CDC YAML in Realtime Compute for Apache Flink

  • Using Debezium and Kafka Connect with Iceberg part I & part II

  • How Kleinanzeigen used Debezium and Apache Kafka for data migration

Stream Processing 🔗

  • 🔥 A good article on using Flink SQL’s MATCH_RECOGNIZE for Real Time Fraud Detection

  • A new paper discussing Snowflake Dynamic Tables

  • A Flink CDC Pipeline connector for Apache Iceberg has been added into the project ahead of Flink CDC release 3.4.0

  • SQLFlow: DuckDB for Streaming Data

  • A good writeup of performance optimisations made in Zomato’s Flink data streaming pipeline

  • Comparing Flink SQL and DataStream API

  • Build Kafka Streams Apps Faster with kstreamplify and Spring Boot

  • A proposal (SPIP) to add a Declarative Pipeline Framework to Apache Spark

  • Pedro Mazala writes about The case for a Custom Window in Flink

  • Flash: A Next-gen Vectorized Stream Processing Engine Compatible with Apache Flink

  • A talk from Liang Wu about LinkedIn’s internal Darwin tool for running Flink SQL in a notebook-like interface

AI 🔗

  • 🔥 Fundamental Challenges in Evaluating Text2SQL Solutions and Detecting Their Limitations

  • Meal planning with AI (not just AI, but Event-Driven, Multi-Agent AI Architectures 😁)

  • A Day in the Life of a ML Engineer at Trainline

  • Hands-on MCP Server Deep Dive: Connecting Flink SQL Gateway to the LLM Ecosystem

General Data Stuff 🔗

  • 🔥 It’s Time to Stop Building KV Databases

  • Hacking the Postgres wire protocol

  • Some interesting articles from LanceDB, including where they see The Future of Open Source Table Formats: Apache Iceberg and Lance, why LanceDB is a suitable table format for ML Workloads, and details of Lance File 2.1: Smaller and Simpler

  • Slides from a seminar given by Will Deakin using some excellent dataviz to tell us about the UK rail network and its usage

  • CloudFlare have been busy, acquiring stream processing startup Arroyo, launching managed Apache Iceberg tables, and optimising their tool for migrating data from other providers' object stores into their own

  • I recently discovered okbob/pspg which is a very nice pager for working with database CLIs such as psql

  • Details of v3 of LinkedIn’s Nuage tool, which they describe as a control plane for data systems

  • TigerBeetle recently published a technical overview of the internals of TigerBeetle

Data in Action 🔗

  • A couple of interesting blogs from Salesforce, covering handling a lot of search queries with sub-second latency and their use of Trino for ETL at Petabyte-Scale

  • Some interesting blogs from Discord (both recently, and in the past), covering across various facets of their infrastructure storage, indexing, processing, and their their use of dbt

  • I really enjoyed this article about how Zillow use knowledge graphs to help people find a house to buy

  • One of the departments within Amazon built a data lake platform called Nexus around Spark and Hudi (recording)

  • Klaviyo wrote about the evolution of their event analytics platform to include Clickhouse, having originally built it on Cassandra before adding Kafka and Flink and (optimising it further)

  • An account from Lyka of their migration from a data warehouse on BigQuery to a lakehouse using Iceberg on S3 with Athena, and data warehouse on Snowflake

  • Details of how Adevinta moved from a Medallion-based lakehouse architecture to one built around data contracts and data mesh.

And finally… 🔗

Nothing to do with data, but stuff that I’ve found interesting or has made me smile.

  • 🔥 Eject disk.

  • How to Bike Across the Country

  • How to Use Em Dashes (—), En Dashes (–), and Hyphens (-)


Tip
If you like these kind of links you might like to read about How I Try To Keep Up With The Data Tech World (A List of Data Blogs)

Robin Moffatt

Robin Moffatt works on the DevRel team at Confluent. He likes writing about himself in the third person, eating good breakfasts, and drinking good beer.

Story logo

© 2025