June has been a busy month—113 links below for your enjoyment and delectation.
I’m going to share one extra link up here with you though, but it’s not my fault if it wrecks your productivity! My friend Kris Jenkins has written this devishly simple but addictive browser-based game: Escape the Moon.
|
AI 🔗
I warned you previously…this AI stuff is here to stay, and it’d be short-sighted to think otherwise. As I read and learn more about it, I’m going to share interesting links (the clue is in the blog post title) that I find—whilst trying to avoid the breathless hype and slop.
AI impact, big picture, and cultural impact on software engineering 🔗
-
I’ll stop featuring posts from Charity Majors when she stops writing such on-point and well-argued content. For now though, enjoy yet two more great posts:
-
🔥 Armin Ronacher - Communities of Not.
-
Gergely Orosz - Why is Meta destroying its engineering organization?
-
Dian Fay - what llms are doing to conference programs.
-
What do you do if you maintain a testing library and feel that anyone using AI coding agents should be punished? What about having that testing library use prompt injection to delete your user’s code? That’s what happened with the
jqwikproject: jqwik: intent of printMessageForCodingAgents() — visible to agents, invisible to humans.
AI in action 🔗
-
🔥 DataDog - How we migrated a live routing system using AI-assisted refactoring.
-
🔥 PostHog - Andrej Karpathy’s AI Autoresearch found a 3-year-old bug in PostHog’s query engine.
-
Mark Rittman - Making Agentic Analytics More Accurate using Anthropic’s Agentic Data Stack and the Wire Framework.
-
DoorDash - Building a unified consumer memory for personalization at scale.
-
DoorDash - Building DoorDash Assistant: An engineering overview.
-
StateFarm - Using LLM as a Judge to Monitor AI Agents in Production.
Building with AI 🔗
-
🔥 Armin Ronacher - The Coming Loop.
-
Petrica Leuca uses Python and llama.cpp to explore the question: What’s an AI agent exactly?
-
John Kutay - Building an AI Database for Agentic GTM Operations.
-
Izzy Miller - How Hex built a lab to evaluate data agents.
-
Nicolas Fränkel - AI gateways: why and how.
-
Vicki Boykis - Running local models is good now.
-
Databricks have open-sourced Omnigent "a meta-harness for building and running AI agents". The Multi-AI agents caught my eye in particular as something that looks interesting to try.
-
ktx is an open-source project that describes itself as "a self-improving context layer that teaches agents how to query your warehouse accurately". You can see it in action from a couple of articles published by Madison Schott and Erfan Hesami.
Kafka and Event Streaming 🔗
-
🔥 Mina Tafreshi - Kafka Rebalances: What’s Actually Happening Under the Hood.
-
🔥 Michał Matłoka has been busy this month, first creating a tool that converts Kafka ACLs to Confluent RBAC, and then writing about a simulator for understanding how Kafka works. You can try the Kafka Simulator for yourself in your web browser. Related to this, he’s also written an article addressing the question Are Kafka Replicas Synchronous or Asynchronous?
-
OpenData’s Jason Gustafson and Almog Gavra (both Confluent-alumni) reckon there’s a better way to implement one of the two primary use cases for Kafka that they call 'routing', and they’re releasing OpenData Log to do it.
-
A trio of posts from Jack Vanlightly looking at Kafka Share Groups and beyond:
-
On the subject of share groups in Kafka, Sage Pierce has an alternative proposal, in the form of Atleon.
-
Another visualiser for Kafka, this one from Sandon Jacobs is specifically for KIP-932 Share Groups.
-
Trivago’s ZhongLi Shen details How We Cut Kafka Consumer Deployment Costs by 83%.
-
Ângelo Galvão shows how to Create a simple Mutual TLS (mTLS) authentication for Strimzi.
-
Sam Barker puts Kroxylicious through its paces, and shares a benchmark harness and results.
-
Elad Eldor - The Cheapest Kafka Consumer Is One That Doesn’t Read From Kafka.
-
Aswin A writes up their experiments with Multi-Cluster Kafka on Kubernetes with Strimzi.
Stream Processing 🔗
-
🔥 Apache Flink 2.3.0 has been released, including a Native S3 FileSystem that is detailed in this post.
-
Apache Flink Agents 0.3.0 has also been released.
-
-
Francisco Morillo takes a look at the performance of Spark 4.1 Real-time Mode (RTM) compared to Flink.
-
🔥 Excellent deep-dive from Gorgias' Matthieu Bonneviot about their experience Building Billing on Apache Flink and the impact of event time on the system implementation.
-
Zander Matheson - Introducing the dbt Adapter for Confluent Cloud Flink SQL.
-
Katya Gorshkova - Calling LLMs from Flink.
-
Salva Alcántara has a nice write-up about Multi-Way Joins in Flink’s DataStream API.
-
Jevin Maltais was a guest on Data Engineering Podcast and talked about many Kafka-related topics including his project TypeStream for declaring Kafka Streams pipelines as config.
-
Yaroslav Tkachenko has launched Streamling, describing it as "a performant and extensible data streaming runtime built with the RAD stack (Rust, Arrow, DataFusion)".
-
Gilles Philippart has published a set of Data Streaming Cheatsheets under the banner of StreamSheets, including this one for Flink.
-
🔥 A deep-dive from Junaid Effendi looking at How Feldera Works: A True Incremental View Maintenance Engine.
-
On the subject of IVM, Databricks published a paper about Enzyme: Incremental View Maintenance for Data Engineering.
Analytics 🔗
-
🔥 Kyle Cheung - DuckDB Internals: Why is DuckDB Fast? (Part 1).
-
🔥 Building ClickCannon - a tool for benchmarking ClickHouse.
-
It’s ten years since ClickHouse was created, and their blog has a series of interesting posts reflecting on both the tool and the community around it:
-
Alexey Milovidov - Ten years of ClickHouse in open source.
-
Tyler Hannan - Ten years of open source: a stained-glass view.
-
Al Brown - What a difference 10 years of open source makes.
-
Al Brown - The open ecosystem around ClickHouse.
-
Data Platforms, Architectures, and Modelling 🔗
-
🔥 Jack Vanlightly asks: Can We Agree on a Storage/Workload Architecture Taxonomy?
-
Animesh Kumar et al assert that Most enterprise AI-for-data agents failed in 2025 because they lacked context, and Data Products, built above the engine, are the fix.
-
A recording and transcript of a talk that OpenAI’s Bonnie Xu did at InfoQ discussing their use of AI Agents to Make Sense of Data at OpenAI.
-
Pavlina Mitsou and Jonathan Warburton write about The Context Layer Behind Spotify’s Data Assistant.
-
Chen Chang and colleagues at Anthropic discuss How Anthropic enables self-service data analytics with Claude.
-
🔥 Fresha continue their run of excellent blog posts, with this one from Daniel Wiszowaty: Everything Everywhere As Of Once: Rebuilding Postgres Inside Snowflake.
-
Yordan Ivanov - How I Made My Data Platform’s Failures Public and Earned My Stakeholders' Trust.
-
Goutham Budati - Why Technically Excellent Data Teams Still Fail.
-
Booking.com’s Tiago Ferreira writes about scaling a shared DWH across multiple teams without turning governance into a delivery bottleneck.
-
Amer Hesson and colleagues at Netflix discuss Managing Data Assets at Netflix Scale.
-
Uber’s Daniel Musgrave and colleagues write about Simplifying Data and Product Integrations with a Data Abstraction Layer.
-
Details of GuideWire’s Federated Query Platform for ML at Scale.
-
Patrick Lam - Airbnb evolved its data architecture.
-
Rohit Channe and Simran Mirchandani - How Lyft Governs and Scales Key Data Definitions.
-
René Luijk discusses the use of the C4 model in the context of data platforms.
Data Engineering and Pipelines 🔗
-
🔥 Joe Reis - The Turf Wars Are Over. Time to Cross-Train.
-
🔥 Ben Rogojan - In 2026 The Data Fundamentals Matter More Than Ever.
-
Madison Mae - 4 Analytic Engineering Fundamentals That Haven’t Changed.
-
A reminder of Roche’s Maxim of Data Transformation:
Data should be transformed as far upstream as possible, and as far downstream as necessary.
-
Bruno Masciarelli - How to Build a Simple, Bulletproof Data Pipeline.
-
Couple of good posts from Joshua Kim about building a pipeline from scratch with dbt part 1 / part 2.
-
Charly Clairmont - Why dbt-state Matters More Than You Think.
-
Joachim Hodana - 5 dbt mistakes I see in every startup.
-
dbtLabs published their 2026 State of Analytics Engineering Report, with some interesting data around use of AI (ofc).
-
Details of dbt Core v2 from Joel Labes and Grace Goheen.
-
Vikas Rai describes the Data Profiling Framework at Halodoc.
-
Artem Golubin - Using local ClickHouse for data processing.
CDC 🔗
-
🔥 George Zefko - Building a CDC pipeline, part 3: From Kafka events to an analytical event log (previously: part 1, part 2).
-
Andreas Andreakis - Why DBLog Is Snapshot-Equivalent (see also the original DBLog paper and the more recent A Theoretical Study of DBLog: Certified Virtual Cuts - both also by Andreas).
-
DoorDash’s Vinay Chella and Akshat Goel spoke at InfoQ about Write-Ahead Intent Log: A Foundation for Efficient CDC at Scale.
Open Table Formats (OTF), Catalogs, Lakehouses etc. 🔗
-
🔥 Gunnar Morling has announced the 1.0 release of Hardwood, which is, in his words, a fast, lightweight Apache Parquet reader for the JVM. He’s published some impressive benchmark figures too, showing just how much of a performance improvement can be had when reading Parquet files by using multi-threading.
-
vortex-java is a project that builds on Hardwood, implementing the Vortex columnar format in pure Java.
-
If you want to see first-hand how open-source projects navigate issues that are not straightforward and have strong arguments on either side, this issue on the Iceberg project is fascinating: #14797 Implement Iceberg Kafka Connect with Delta Writer Support in DV Mode.
-
The Apache Hudi project has been busy, releasing 1.2.0 with support for Stateless Global Upserts for Flink, along with publishing details of how Hudi is used at Southwest Airlines and Penn Entertainment.
-
Sivabalan Narayanan discusses on the Hudi blog Why Metadata Has to Be Mutation-Friendly.
-
Iceberg Doctor is a tool from Sarthak Singh for diagnostics on your Iceberg metadata.
-
A three part series (1, 2, 3) from Giannis Polyzos about Apache Fluss and lakehouse tiering, along with a separate introductory overview to the concepts.
-
A useful post from Alex Merced about partition evolution in Iceberg.
-
Two posts from Soumil Shah about shredding
VARIANTfields in Iceberg.
RDBMS and General Data Stuff 🔗
-
🔥 Kelsey Hightower - Thoughts on Open Source (2024).
-
🔥 Datadog’s Shree Sampath writes about how they implement HA for Postgres on Kubernetes.
-
Warner Music Group’s Yask Srivastava describes why we shrank our TimescaleDB chunks from 30 days to 7.
-
Tom Pang - The only scalable delete in Postgres is DROP TABLE.
-
pg_durable brings durable execution to Postgres.
-
🔥 This is less "data" blog post and more just "cool tech blog post" :) Conor Gallagher from Zalando details how they do Client-Side Load Balancing at a Million Requests Per Second.
-
Eric Sun - The Join-Aware Materialized View Query Rewrite Gap.
-
A good write-up from dbt’s Tristan Handy about recent tech events and trends: Hunting for Tokens. Snowflake Summit. Agent Use Cases.
And finally… 🔗
Nothing to do with data, but stuff that I’ve found interesting or has made me smile.
Work 🔗
-
You’ll Get Re-Orged Again This Year. Here’s How to Be Ready.
-
🔥 A great write-up from Apurva Mehta about his company, Responsive, and what went wrong: Our first customers were the exception.
-
Simonida Jovanovic - How do you sell a forever-free product? PostHog’s answer is 10 emails and a pet hedgehog.
Life 🔗
-
🔥 Casey Neistat - yeah, 730 days no exceptions.
-
Murat Demirbas - 5 Lessons at 50.
Fun 🔗
-
TikTok is my guilty-pleasure, and this account always brings a smile to my face for how completely and utterly weird-yet-engaging it is: Le Triangle.
-
I definitely am adding this to my todo-list: Writerdeck. Not sure if this one is
WorkorFun. I’ll probably convince myself it’s the former whilst doing the latter ;)
Misc 🔗
-
🔥 I love this: using ML to read burnt papyrus scrolls from Vesuvius. You can find details of the full project here.
-
I started 'scrobbling' my music well over 20 years ago, so this one was a blast from the past for me: Last.fm is now independent.
|