Interesting links - April 2025
So. Many. Interesting. Links. Not got time for all this? I’ve marked 🔥 for my top reads of the month :)
So. Many. Interesting. Links. Not got time for all this? I’ve marked 🔥 for my top reads of the month :)
Confluent Cloud for Apache Flink gives you access to run Flink workloads using a serverless platform on Confluent Cloud. After poking around the Confluent Cloud API for configuring connectors I wanted to take a look at the same for Flink.
Using the API is useful particularly if you want to script a deployment, or automate a bulk operation that might be tiresome to do otherwise. It’s also handy if you just prefer living in the CLI :)
The problem with publishing February’s interesting links at the beginning of the month and now getting around to publishing March’s at the end is that I have nearly two months' worth of links to share 😅 So with no further ado, let’s crack on.
tl;dr: Upload a PDF document in which each slide of the carousel is one page. |
I wanted to post a Carousel post in LinkedIn, but had to wade through a million pages of crap in Google from companies trying to sell shit. Here’s how to do it simply.
In this blog post I’m going to explore how as a data engineer in the field today I might go about putting together a rudimentary data pipeline. I’ll take some operational data, and wrangle it into a form that makes it easily pliable for analytics work.
After a somewhat fevered and nightmarish period during which people walked around declaring "Schema on Read" was the future, that "Data is the new oil", and "Look at the size of my big data", the path that is history in IT is somewhat coming back on itself to a more sensible approach to things.
As they say:
What’s old is new
This is good news for me, because I am old and what I knew then is 'new' now ;)
DuckDB added a very cool UI last week and I’ve been using it as my primary interface to DuckDB since.
One thing that bothered me was that the SQL I was writing in the notebooks wasn’t exportable. Since DuckDB uses DuckDB in the background for storing notebooks, getting the SQL out is easy enough.
I wrote a couple of weeks ago about using DuckDB and Rill Data to explore a new data source that I’m working with. I wanted to understand the data’s structure and distribution of values, as well as how different entities related. This week DuckDB 1.2.1 was released and that little 0.0.1 version boost brought with it the DuckDB UI.
Here I’ll go through the same process as I did before, and see how much of what I was doing can be done in DuckDB alone now.
Much as I love kcat (🤫 it’ll always be kafkacat to me…), this morning I nearly fell out with it 👇
😖 I thought I was going stir crazy, after listing topics on a broker and seeing topics from a different broker.
😵 WTF 😵
Some would say that the perfect blog article takes the reader on a journey on in which the development process looks like this:
The UK Government publishes a lot of its data as open feeds. One that I keep coming back to is the Environment Agency’s flood-monitoring API that gives access to an estate of sensors that provide information about data such as river levels and rainfall.
The data is well-structured and provided across three primary API endpoints. In this blog article I’m going to show you how I use Flink SQL to explore and wrangle these into the kind of form from which I am then going to build a streaming pipeline using them.
There was a useful question on the Apache Flink Slack recently about joining data in Flink SQL:
How can I join two streams of data by id in Flink, to get a combined view of the latest data?
Let’s imagine we’ve got a source of data with a nested array of multiple values. The data is from an IoT device. Each device has multiple sensors, each sensor provides a reading.
The UK Environment Agency publishes a feed of data relating to rainfall and river levels. As a prelude to building a streaming pipeline with this data, I wanted to understand the model of it first.
I was exploring some new data, joining across multiple tables, and doing a simple SELECT *
as I’d not worked out yet which columns I actually wanted.
The issue was, the same field name existing in more than one table.
This meant that in the results from the query, it wasn’t clear which field came from which table:
Here’s a bunch of interesting links and articles about data that I’ve come across recently.
I’m a HUGE fan of Docs as Code in general, and specifically tools like Vale that lint your prose for adherence to style rule.
One thing that had been bugging me though was how to selectively disable Vale for particular sections of a document. Usually linting issues should be addressed at root: either fix the prose, or update the style rule. Either it’s a rule, or it’s not, right?
Sometimes though I’ve found a need to make a particular exception to a rule, or simply needed to skip linting for a particular file. I was struggling with how to do this in Asciidoc. Despite the documentation showing how to, I could never get it to work reliably. Now I’ve taken some time to dig into it, I think I’ve finally understood :)
At Current 24 a few of us will be going for an early run (or walk) on Tuesday morning. Everyone is very welcome!
I do my best to try and keep, if not abreast of, then at least aware of what’s going on in the world of data. That includes RDBMS, Event streaming, stream processing, open source data projects, data engineering, object storage, and more. If you’re interested in the same, then you might find this blog useful, because I’m sharing my sources :)
Let’s not bury the lede: it was DNS. However, unlike the meme ("It’s not DNS, it’s never DNS. It was DNS"), I didn’t even have an inkling that DNS might be the problem.
I’m writing a new blog about streaming Apache Kafka data to Apache Iceberg and wanted to provision a local Kafka cluster to pull data from remotely. I got this working nicely just last year using ngrok to expose the broker to the interwebz, so figured I’d use this again. Simple, right?
Nope.