Here’s a bunch of interesting links and articles about data that I’ve come across recently.
-
Martin Kleppmanm’s seminal talk from 2015, Turning the database inside out came up on my feed recently, and is still such an important work.
-
Going back even further, check out the original SQL paper, from 1974.
-
Not only do I love the clever title, but The End of the Bronze Age: Rethinking the Medallion Architecture is also a really good explanation of how "shift left" applies in the data world. If you prefer video there’s one of those too.
-
An interesting interview from a while back with Materialize’s CTO, Nikhil Benesch.
-
A useful look at the practicalities of Data Products, Data Contracts, and Change Data Capture.
-
I recently came across an interesting project from the European Union called Big Data Test Infrastructure. Despite the slightyly old-fashioned name, they’re doing some cool stuff with data and public services, such as this one looking at tree health in a town in Germany.
-
DataDog have their own proprietary event storage system called Husky. They’ve previously shared details of the ingestion process, and have recently posted how data compaction at scale is handled.
-
Two Apache projects were recently announced as graduating to top-level projects, including Apache StreamPark.
-
Excellent analysis from Jack Vanlightly looking at Why Snowflake wants streaming (specifically, Redpanda, about whom acquisition rumours are swirling).
-
What better way to learn the low-level details of Kafka than writing your own broker.
-
Confluent recently launched a VSCode plugin which now supports Kafka clusters too (not just Confluent Cloud).
-
A fantastic deep-dive blog on Kafka transactions.
-
A nicely explained and illustrated guide to windowing in Kafka Streams.
-
Mickael Maison has now been writing the Kafka Monthly Digest for an impressive seven years!
-
Hyprstream is a built on Apache Arrow Flight and DuckDB for "real-time data ingestion, windowed aggregation, caching, and serving". Read the associated paper here.
-
Uber run 2,300 MySQL clusters— this post has details of how they do it.