Sneaking it in just before the end of the month!
It’s a bumper set of links this month—I started with an original backlog of 125 links to get through. Some fell by the wayside, but plenty of others (78, to be precise) made the cut. With no further ado, let’s get cracking!
Not got time for all this? I’ve marked 🔥 for my top reads of the month :)
Data Engineering and Architecture 🔗
-
Some blog posts don’t need to be long. This, from Lawrence Kesteloot, is one of them. Use singular nouns for database table names.
-
DuckDB and DuckLake in action in this blog post from Daniel Wallace
-
Simon Späti has published an in-depth Data Modeling Guide for Real-Time Analytics with ClickHouse
-
🔥 It’s all kicked off again in the blogosphere, with Daniel Beach calling out the Medallion Architecture Farce, before pondering whether the architecture itself is Truth or Fiction?. He also took a moment to raise concerns that data modeling may be dead. Ananth Packkildurai posted a calm and measured analysis of the architecture, and Joe Reis reminded us that the medallion architecture is NOT a Data Model. I’ll even chuck in my earlier thoughts on it, in which I draw clear parallels Oracle’s reference architecture back in 2013… 😁
Meanwhile, Robert Anderson shared his scepticism of Data Vault.
-
DuckDB 1.4 was released, and with it the first LTS (Long Term Support) version. New features include support for writing Apache Iceberg files (see below),
MERGE
, and database encryption. -
Max Gabrielsson discusses the use of spatial data in DuckDB and the improvements made to performance of spatial joins in 1.3.
-
Getting data from Kafka to Snowflake can be done in several different ways; Emma Amor covers two of them in this blog post.
AI 🔗
Note
|
Wait, what’s this? A new section this month, all about AI? Is Robin now drinking the hype-juice too? Don’t worry, this isn’t a rebranding of AI is important, and it’s here to stay. To the nay-sayers who scoff at the errors it makes and laugh at the idea that it can do our jobs…you are missing the point. Some of the attitudes I’ve encountered give me heavy vibes of Oracle DBAs 15 years ago who derided the idea of "The Cloud". That came to pass, completely upending how we build things—and so will AI. (We’ll ignore Blockchain for now…not every hype turns into reality 😉). |
🔥 Sam Newman posted an excellent note on LinkedIn, which begins:
To those of you who are deeply pessimistic around the use of AI in software delivery, the old quote from John Maynard Keynes comes to mind:
"The market can remain irrational longer than you can remain solvent".
Go read the rest of the post (it’s not long). In addition, Scott Werner’s article 🔥 The Only Skill That Matters Now puts it even more clearly into focus, with a nice analogy about how "skating to the puck" is no longer a viable strategy (tl;dr the rate of change in AI means you have no idea where the puck will even be).
The impact of AI going to be felt universally. Here are some interesting articles that I’ve come across this month about it in the sphere of data:
-
A paper titled Supporting Our AI Overlords: Redesigning Data Systems to be Agent-First discussing the ways in which LLMs want to retrieve data and how we might change how we model data to support that. Murat Demirbas has a nice analysis and commentary on the paper.
-
A summary of a presentation given by Xintong Song on the new Flink Agents project (a formal sub-project of Apache Flink itself)
-
MCP servers are a nice way to provide standard interoperability between LLMs and other computer systems. I’ve not these ones out specifically but the idea of being able to chat to Claude about Flink, Kafka, and Confluent Cloud certainly sounds a cool idea :)
-
A good account from Pedro Nascimento of why what sounds like a simple enough idea ("build an AI-powered data analyst") is a lot more complex than you may think.
Iceberg (and other OTF/Data Lake stuff) 🔗
I mean, there may be some Delta Lake, Hudi, and DuckLake in here…but in my corner of the internet it’s Iceberg all the way…
-
Apache Iceberg 1.10 was released
-
DuckDB’s support for writing to Iceberg is covered by Angel Conde and Dwicky Feri.
-
🔥 Anton Borisov has been busy this month, with a deep-dive on Iceberg’s Merge-on-Read (MoR) support in StarRocks, along with a practical guide to the evolution and key differences between Iceberg, Delta, Paimon, Delta, and DuckLake.
-
DuckLake 0.3 was released, including support for copying to and from Iceberg.
-
There are several interesting Iceberg performance articles, including a grab-bag of 11 tools and tips, discussion from Vincent Daniel of the
MERGE
statement, and a write-up from Ancestry on their optimisation of a 100-billion row Iceberg table. -
As well as Iceberg optimisation (above), Vincent Daniel also has a good blog post about WAP (Write, Audit, Publish) in Iceberg. This is something I spent some time looking at in the past too, and still think is a good data engineering pattern to draw on.
-
Jeffrey Jonathan Jennings has a nice hands-on example of deploying a platform of Kafka, Flink, and Iceberg with Confluent Cloud.
-
🔥 A thoughtful write-up from Ananth Packkildurai (he of Data Engineering Weekly fame) addressing the challenge of Fast Changing Dimensions (FCD) in Iceberg, and examining some alternative architectures and technologies.
-
🔥 A well-written and spicy take from WarpStream’s Richie Artoul on getting data from Kafka into Iceberg and where he sees some of the proposals from the community as flawed.
-
On the admin side of things, Nimtable’s compaction runtime for Iceberg has been released (open source/Apache 2.0), and Apache Amoro claim 10x performance improvements from their table maintenance feature.
-
The Iceberg community is thriving and growing, and you can find talks from several recent meetups in Europe online here.
Kafka and Event Streaming 🔗
-
Apache Kafka 4.1 was released. Sandon Jacobs has a useful summary of the key new features.
-
Shruti Mantri has a good article about Queues for Apache Kafka.
-
🔥 A few nice bits from Stanislav Kozlovski this month, with a deep-dive on How Kafka Works, an infographic of the Top 5 largest Kafka deployments, and advice on sizing a Kafka cluster that’s using tiered storage.
-
Aleksei Aleinikov writes about why and how you should do authorisation with roles in Kafka.
-
🔥 A nice deep-dive two part series from Edoardo Vacchi looking at extending the Kafka broker
-
PagerDuty had an outage last month, at the heart of which was Kafka and an error in the implementation of an application using it. Read the gory details here.
-
Klaviyo migrated from RabbitMQ to Kafka - read about why and how and impact in these two blog posts.
-
"Does Kafka Guarantee Message Delivery?" is a question that prompted this blog post and some discussion over on r/apachekafka.
-
Jaehyeon Kim built a custom SMT (Single Message Transform) for Kafka Connect to add observability into a pipeline.
Stream Processing 🔗
-
Flink CDC 3.5 has been released, which includes new pipeline connectors for Apache Fluss and Postgres
-
Lorenzo Nicora and Felix John published a two part blog series on application lifecycle when using Amazon’s Managed service for Flink (MSF).
-
Jack Vanlightly published one of his fantastic deep-dives, this time looking in great detail at Apache Fluss.
-
🔥 Anton Borisov writes about Fluss, comparing it in use with Flink 2.1’s DeltaJoin feature to standalone solutions from RisingWave and Feldera
-
A nice little GitHub repo from Gordon Murray in which he shows how to get up and running with Fluss, Paimon, and Flink.
-
An example application from Sebastien Viale showing how to ensure Kafka Streams uses explicit resource naming (added in Kafka 4.1 with KIP-1111)
-
Details from a talk by Yuan Mei about Flink state management.
General Data Stuff 🔗
-
🔥 A thorough history and analysis of digital analytics from Timo Dechau, covering Google Analytics, GA4, and more, up to the current state of affairs.
-
Andrew Lamb writes about performance improvements when working with Parquet files.
-
Cloudflare announced their Data Platform, including the Arroyo-acquisition driven Cloudflare Pipelines, R2 Data Catalog, and a distributed SQL engine called R2 SQL.
-
InfoQ published their AI, ML and Data Engineering Trends Report.
-
DriftDB is an "experimental append-only database with built-in time travel".
-
Avinash Sajjanshetty muses on replacing a cache service with a database.
-
Postgres 18 adds the ability to get current and previous row values in the
RETURNING
clause which sounds neat. -
pgstream is an Apache 2.0 licensed project from Xata that offers Postgres replication to targets including Kafka.
-
🔥 Good analysis from RedMonk’s Stephen O’Grady on the open-source data storage space, including Postgres vs MySQL, MongoDB, and DocumentDB.
Data in Action 🔗
-
Details of how Netflix built a Write-Ahead-Log (WAL) to make their data platform more resilient.
-
Cursor migrated from AWS Aurora Limitless to PlanetScale.
-
Wix saved 50% of their data platform costs by moving their Spark workloads from EMR to EMR on EKS—they cover why and how in this two part series.
-
dbt in action at BlaBlaCar.
-
🔥 Netflix built their Muse analytics platform originally on Druid with offline Spark, but in order to meet performance requirements moved to using their homegrown Hollow tool for pre-aggregating data, along with Druid still plus Spark and Iceberg offline.
-
Some details of the data architecture at Decathlon, and how they use Polars.
-
How Stripe use Apache Flink for real-time analytics.
-
Details of how Uber replicate between their two HDFS-based datalakes using HiveSync.
-
🔥 A nice under-the-covers look at Fresha’s data lakehouse architecture from Paritosh Anand.
-
Chick-fil-A’s Caleb Lampert describes their Data Asset Certification Framework (and its relationship to soup…)
-
Airbnb built their own K/V store called Mussel—read about the original V1 and the re-architected V2.
-
Metagenomi write about how they use LanceDB on S3.
-
A write-up of a talk given by Xiaotong Jiang from Databricks on how they approach OLTP database performance and optimisation in a multi-tenant architecture.
-
Details of how Bazaarvoice migrated from RDS MySQL to AWS Aurora.
-
🔥 A deep-dive on how Motherduck is built by Stephanie Wang (previously a founding engineer at Motherduck).
-
Practical tips from Sadeq Dousti at Trade Republic on the implementation of the outbox pattern, based on their experiences.
-
How Grab use Pinot (and Kafka and Flink) for low-latency analytics.
Newsletters 🔗
If you can’t wait for this monthly round-up of links, you might like the following:
-
TLDR (there’s a general tech edition, plus additional specialist ones for data, AI, etc)
And finally… 🔗
Nothing to do with data, but stuff that I’ve found interesting or has made me smile.
-
🔥 Kirill Bobrov - How the Community Turned Into a SaaS Commercial
-
TIL: Line Scan Cameras
-
🔥 How I, a non-developer, read the tutorial you, a developer, wrote for me, a beginner
-
Calling boss a dickhead was not a sackable offence, tribunal rules
-
Dayvi Schuster - Dev Culture Is Dying The Curious Developer Is Gone
Tip
|
If you like these kind of links you might like to read about How I Try To Keep Up With The Data Tech World (A List of Data Blogs) |