Interesting links - April 2025

22 Apr 2025 by · Interesting Links at https://rmoff.net/2025/04/22/interesting-links-april-2025/

Table of Contents

🔥 Not got time for all this? I’ve marked my top reads of the month :)
📧 Want to receive this monthly round-up as an email? Subscribe to my Substack where I cross-post the same content
🔗 Medium posts often skulk behind a gate, so I’ve hyperlinked to the Freedium version. You’ll see [Medium ↗] next to each link if you prefer the original.

Data Engineering 🔗

🔥 Data Engineering: Now with 30% More Bullshit
🔥 A good article from Andrew Jones on the concept of "shift left"
Data Model Smells
A love letter to the CSV format
The 2025 State of Analytics Engineering Report
Useful writeup from Anders Swanson on [Iceberg, the Iceberg REST Catalog Specification, and more
Georg Heiler - Upskilling data engineers

Kafka 🔗

CDC 🔗

🔥 A Deep Dive Into Ingesting Debezium Events From Kafka With Flink SQL
A really good illustration of how CDC can enable low-latency use of data from transactional systems without impacting the OLTP workloads
Best Practices for Flink CDC YAML in Realtime Compute for Apache Flink
Using Debezium and Kafka Connect with Iceberg part I & part II
How Kleinanzeigen used Debezium and Apache Kafka for data migration

Stream Processing 🔗

🔥 A good article on using Flink SQL’s MATCH_RECOGNIZE for Real Time Fraud Detection
A new paper discussing Snowflake Dynamic Tables
A Flink CDC Pipeline connector for Apache Iceberg has been added into the project ahead of Flink CDC release 3.4.0
SQLFlow: DuckDB for Streaming Data
A good writeup of performance optimisations made in Zomato’s Flink data streaming pipeline
Comparing Flink SQL and DataStream API
Build Kafka Streams Apps Faster with kstreamplify and Spring Boot
A proposal (SPIP) to add a Declarative Pipeline Framework to Apache Spark
Pedro Mazala writes about The case for a Custom Window in Flink
Flash: A Next-gen Vectorized Stream Processing Engine Compatible with Apache Flink
A talk from Liang Wu about LinkedIn’s internal Darwin tool for running Flink SQL in a notebook-like interface

AI 🔗

🔥 Fundamental Challenges in Evaluating Text2SQL Solutions and Detecting Their Limitations
Meal planning with AI (not just AI, but Event-Driven, Multi-Agent AI Architectures 😁)
A Day in the Life of a ML Engineer at Trainline
Hands-on MCP Server Deep Dive: Connecting Flink SQL Gateway to the LLM Ecosystem

General Data Stuff 🔗

🔥 It’s Time to Stop Building KV Databases
Hacking the Postgres wire protocol
Some interesting articles from LanceDB, including where they see The Future of Open Source Table Formats: Apache Iceberg and Lance, why LanceDB is a suitable table format for ML Workloads, and details of Lance File 2.1: Smaller and Simpler
Slides from a seminar given by Will Deakin using some excellent dataviz to tell us about the UK rail network and its usage
CloudFlare have been busy, acquiring stream processing startup Arroyo, launching managed Apache Iceberg tables, and optimising their tool for migrating data from other providers' object stores into their own
I recently discovered okbob/pspg which is a very nice pager for working with database CLIs such as psql
Details of v3 of LinkedIn’s Nuage tool, which they describe as a control plane for data systems
TigerBeetle recently published a technical overview of the internals of TigerBeetle

Data in Action 🔗

A couple of interesting blogs from Salesforce, covering handling a lot of search queries with sub-second latency and their use of Trino for ETL at Petabyte-Scale
Some interesting blogs from Discord (both recently, and in the past), covering across various facets of their infrastructure storage, indexing, processing, and their their use of dbt
I really enjoyed this article about how Zillow use knowledge graphs to help people find a house to buy
One of the departments within Amazon built a data lake platform called Nexus around Spark and Hudi (recording)
Klaviyo wrote about the evolution of their event analytics platform to include Clickhouse, having originally built it on Cassandra before adding Kafka and Flink and (optimising it further)
An account from Lyka of their migration from a data warehouse on BigQuery to a lakehouse using Iceberg on S3 with Athena, and data warehouse on Snowflake
Details of how Adevinta moved from a Medallion-based lakehouse architecture to one built around data contracts and data mesh.

And finally… 🔗

Nothing to do with data, but stuff that I’ve found interesting or has made me smile.

📧 Want to receive this monthly round-up as an email? Subscribe to my Substack where I cross-post the same content
If you like these kind of links you might like to read about How I Try To Keep Up With The Data Tech World (A List of Data Blogs)

On this page