So. Many. Interesting. Links. Not got time for all this? I’ve marked 🔥 for my top reads of the month :)
Data Engineering 🔗
-
🔥 A good article from Andrew Jones on the concept of "shift left"
-
Useful writeup from Anders Swanson on [Iceberg, the Iceberg REST Catalog Specification, and more
Kafka 🔗
-
🔥 Taking out the Trash: Garbage Collection of Object Storage at Massive Scale
-
KIP-1150: Diskless Topics - Apache Kafka - Apache Software Foundation
-
ktea
- a Kafka TUI client -
Behind Sending Millions of Messages Per Second: A Look Under the Hood of Kafka Producer
-
Benchmarking Kafka: Distributed Workers and Workload topology in OpenMessaging Benchmark
-
Queues for Kafka, my opinion (see also: no true scotsman)
CDC 🔗
-
🔥 A Deep Dive Into Ingesting Debezium Events From Kafka With Flink SQL
-
A really good illustration of how CDC can enable low-latency use of data from transactional systems without impacting the OLTP workloads
-
Best Practices for Flink CDC YAML in Realtime Compute for Apache Flink
-
Using Debezium and Kafka Connect with Iceberg part I & part II
-
How Kleinanzeigen used Debezium and Apache Kafka for data migration
Stream Processing 🔗
-
🔥 A good article on using Flink SQL’s
MATCH_RECOGNIZE
for Real Time Fraud Detection -
A new paper discussing Snowflake Dynamic Tables
-
A Flink CDC Pipeline connector for Apache Iceberg has been added into the project ahead of Flink CDC release 3.4.0
-
A good writeup of performance optimisations made in Zomato’s Flink data streaming pipeline
-
Build Kafka Streams Apps Faster with kstreamplify and Spring Boot
-
A proposal (SPIP) to add a Declarative Pipeline Framework to Apache Spark
-
Pedro Mazala writes about The case for a Custom Window in Flink
-
Flash: A Next-gen Vectorized Stream Processing Engine Compatible with Apache Flink
-
A talk from Liang Wu about LinkedIn’s internal Darwin tool for running Flink SQL in a notebook-like interface
AI 🔗
-
🔥 Fundamental Challenges in Evaluating Text2SQL Solutions and Detecting Their Limitations
-
Meal planning with AI (not just AI, but Event-Driven, Multi-Agent AI Architectures 😁)
-
Hands-on MCP Server Deep Dive: Connecting Flink SQL Gateway to the LLM Ecosystem
General Data Stuff 🔗
-
Some interesting articles from LanceDB, including where they see The Future of Open Source Table Formats: Apache Iceberg and Lance, why LanceDB is a suitable table format for ML Workloads, and details of Lance File 2.1: Smaller and Simpler
-
Slides from a seminar given by Will Deakin using some excellent dataviz to tell us about the UK rail network and its usage
-
CloudFlare have been busy, acquiring stream processing startup Arroyo, launching managed Apache Iceberg tables, and optimising their tool for migrating data from other providers' object stores into their own
-
I recently discovered okbob/pspg which is a very nice pager for working with database CLIs such as psql
-
Details of v3 of LinkedIn’s Nuage tool, which they describe as a control plane for data systems
-
TigerBeetle recently published a technical overview of the internals of TigerBeetle
Data in Action 🔗
-
A couple of interesting blogs from Salesforce, covering handling a lot of search queries with sub-second latency and their use of Trino for ETL at Petabyte-Scale
-
Some interesting blogs from Discord (both recently, and in the past), covering across various facets of their infrastructure storage, indexing, processing, and their their use of dbt
-
I really enjoyed this article about how Zillow use knowledge graphs to help people find a house to buy
-
One of the departments within Amazon built a data lake platform called Nexus around Spark and Hudi (recording)
-
Klaviyo wrote about the evolution of their event analytics platform to include Clickhouse, having originally built it on Cassandra before adding Kafka and Flink and (optimising it further)
-
An account from Lyka of their migration from a data warehouse on BigQuery to a lakehouse using Iceberg on S3 with Athena, and data warehouse on Snowflake
-
Details of how Adevinta moved from a Medallion-based lakehouse architecture to one built around data contracts and data mesh.
And finally… 🔗
Nothing to do with data, but stuff that I’ve found interesting or has made me smile.
Tip
|
If you like these kind of links you might like to read about How I Try To Keep Up With The Data Tech World (A List of Data Blogs) |