Current 2024 - 5k Fun Run (or Walk)
At Current 24 a few of us will be going for an early run (or walk) on Tuesday morning. Everyone is very welcome!
At Current 24 a few of us will be going for an early run (or walk) on Tuesday morning. Everyone is very welcome!
I do my best to try and keep, if not abreast of, then at least aware of what’s going on in the world of data. That includes RDBMS, Event streaming, stream processing, open source data projects, data engineering, object storage, and more. If you’re interested in the same, then you might find this blog useful, because I’m sharing my sources :)
Let’s not bury the lede: it was DNS. However, unlike the meme ("It’s not DNS, it’s never DNS. It was DNS"), I didn’t even have an inkling that DNS might be the problem.
I’m writing a new blog about streaming Apache Kafka data to Apache Iceberg and wanted to provision a local Kafka cluster to pull data from remotely. I got this working nicely just last year using ngrok to expose the broker to the interwebz, so figured I’d use this again. Simple, right?
Nope.
After a break from using AWS I had reason to reacquaint myself with it again today, and did so via the CLI. The AWS CLI is pretty intuitive and has a good helptext system, but one thing that kept frustrasting me was that after closing the help text, the screen cleared—so I couldn’t copy the syntax out to use in my command!
The same thing happened when I ran a command that returned output - the screen cleared.
Here’s how to fix either, or both, of these
At this year’s Kafka Summit I’m planning to continue the tradition of going for a run (or walk) with anyone who’d like to join in. This started back at Kafka Summit San Francisco in 2019 over the Golden Gate Bridge and has continued since then. Whilst London’s Docklands might not offer quite the same experience it’ll be fun nonetheless.
This year Kafka Summit London includes a dedicated track for talks about Apache Flink. This reflects the continued rise of interest and use of Apache Flink in the streaming community, as well as the focus that Confluent (the hosts of Kafka Summit) has on it.
I’m looking forward to being back at Kafka Summit. I will be speaking on Tuesday afternoon, room hosting on Wednesday morning, and hanging out at the Decodable booth in between too.
Here’s a list of all the Flink talks, including the talk, time, and speaker. You find find more details, and the full Kafka Summit agenda, here.
At Decodable we migrated our docs platform onto Antora. I wrote previously about my escapades in getting cross-repository authentication working using Private Access Tokens (PAT). These are fine for just a single user, but they’re tied to that user, which isn’t a good practice for deployment in this case.
In this article I’ll show how to use GitHub Apps and Installation Access Tokens (IAT) instead, and go into some detail on how we’ve deployed Antora. Our GitHub repositories are private which makes it extra-gnarly.
A friend messaged me late last night with the scary news that Google had emailed him about a ton of spammy subdomains on his own domain.
Any idea how this could have happened, he asked?
Why should the Java folk have all the fun?!
My friend and colleague Gunnar Morling launched a fun challenge this week: how fast can you aggregate and summarise a billion rows of data? Cunningly named The One Billion Row Challenge (1BRC for short), it’s aimed at Java coders to look at new features in the language and optimisation techniques.
Not being a Java coder myself, and seeing how the challenge has already unofficially spread to other communities including Rust and Python I thought I’d join in the fun using what I know best: SQL.
Antora is a modern documentation site generator with many nice features including sourcing documentation content from one or more separate git repositories. This means that your docs can be kept under source control (yay 🎉) and in sync with the code of the product that they are documenting (double yay 🎉🎉).
As you would expect for a documentation tool, the Antora documentation is thorough but there was one sharp edge involving GitHub that caught me out which I’ll detail here.
AI, what a load of hyped-up bollocks, right? Yet here I am, legit writing a blog about it and not for the clickbait but…gasp…because it’s actually useful.
Used correctly, it’s just like any other tool on your desktop. It helps you get stuff done quicker, better—or both.
As a newcomer to Apache Flink one of the first things I did was join the Slack community (which is vendor-neutral and controlled by the Flink PMC). At the moment I’m pretty much in full-time lurker mode, soaking up the kind of questions that people have and how they’re using Flink.
One question that caught my eye was from Marco Villalobos, in which he asked about the Flink JDBC driver and a SQLDataException
he was getting with a particular datatype. Now, unfortunately, I have no idea about the answer to this question—but the idea of a JDBC driver through which Flink SQL could be run sounded like a fascinating path to follow after previously looking at the SQL Client.
Sometimes you might want to access Apache Kafka that’s running on your local machine from another device not on the same network. I’m not sure I can think of a production use-case, but there are a dozen examples for sandbox, demo, and playground environments.
In this post we’ll see how you can use ngrok to, in their words, Put localhost on the internet
. And specifically, your local Kafka broker on the internet.
When I started my journey learning Apache Flink one of the things that several people expressed an interest in hearing more about was PyFlink. This appeals to me too, because whilst Java is just something I don’t know and feels beyond me to try and learn, Python is something that I know enough of to at least hack my way around it. I’ve previously had fun with PySpark, and whilst Flink SQL will probably be one of my main focusses, I also want to get a feel for PyFlink.
The first step to using PyFlink is installing it - which should be simple, right?
So far I’ve plotted out a bit of a map for my exploration of Apache Flink, looked at what Flink is, and run my first Flink application. Being an absolutely abysmal coder—but knowing a thing or two about SQL—I figure that Flink SQL is where my focus is going to lie (I’m also intrigued by PyFlink, but that’s for another day…).
🎉 I just ran my first Apache Flink cluster and application on it 🎉
A brief diversion from my journey learning Apache Flink to document an interesting zsh
oddity that briefly tripped me up:
cd: string not in pwd: flink-1.17.1
My journey with Apache Flink begins with an overview of what Flink actually is.
What better place to start than the Apache Flink website itself:
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
Like a fortunate child on Christmas Day, I’ve got a brand new toy! A brand new—to me—open-source technology to unwrap, learn, and perhaps even aspire to master elements of within.
I joined Decodable two weeks ago, and since Decodable is built on top of Apache Flink it seems like a great time to learn it. After six years learning Apache Kafka and hearing about this “Flink” thing but—for better or worse—never investigating it, I now have the perfect opportunity to do so.