Loading delimited data into Kafka - quick & dirty (but effective)
Whilst Apache Kafka is an event streaming platform designed for, well, streams of events, it’s perfectly valid to use it as a store of data which perhaps changes only occasionally (or even never). I’m thinking here of reference data (lookup data) that’s used to enrich regular streams of events.
You might well get your reference data from a database where it resides and do so effectively using CDC - but sometimes it comes down to those pesky CSV files that we all know and love/hate. Simple, awful, but effective. I wrote previously about loading CSV data into Kafka from files that are updated frequently, but here I want to look at CSV files that are not changing. Kafka Connect simplifies getting data in to (and out of) Kafka but even Kafka Connect becomes a bit of an overhead when you just have a single file that you want to load into a topic and then never deal with again. I spent this afternoon wrangling with a couple of CSV-ish files, and building on my previous article about neat tricks you can do in bash with data, I have some more to share with you here :)
📼 ksqlDB HOWTO - A mini video series 📼
Some people learn through doing - and for that there’s a bunch of good ksqlDB tutorials here and here. Others may prefer to watch and listen first, before getting hands on. And for that, I humbly offer you this little series of videos all about ksqlDB. They’re all based on a set of demo scripts that you can run for yourself and try out.
🚨 Make sure you subscribe to my YouTube channel so that you don’t miss more videos like these!
Performing a GROUP BY on data in bash
One of the fun things about working with data over the years is learning how to use the tools of the day—but also learning to fall back on the tools that are always there for you - and one of those is bash and its wonderful library of shell tools.
There’s an even better way than I’ve described here, and it’s called visidata . I’ve written about it more over here.
|
I’ve been playing around with a new data source recently, and needed to understand more about its structure. Within a single stream there were multiple message types.
Running as root on Docker images that don’t use root
tl;dr: specify the --user root
argument:
docker exec --interactive \
--tty \
--user root \
--workdir / \
container-name bash
Running a self-managed Kafka Connect worker for Confluent Cloud
Confluent Cloud is not only a fully-managed Apache Kafka service, but also provides important additional pieces for building applications and pipelines including managed connectors, Schema Registry, and ksqlDB. Managed Connectors are run for you (hence, managed!) within Confluent Cloud - you just specify the technology to which you want to integrate in or out of Kafka and Confluent Cloud does the rest.
Creating topics with Kafka Connect
When Kafka Connect ingests data from a source system into Kafka it writes it to a topic. If you have set auto.create.topics.enable = true
on your broker then the topic will be created when written to. If auto.create.topics.enable = false
(as it is on Confluent Cloud and many self-managed environments, for good reasons) then you can tell Kafka Connect to create those topics first. This was added in Apache Kafka 2.6 (Confluent Platform 6.0) - prior to that you had to manually create the topics yourself otherwise the connector would fail.
Kafka Connect - Deep Dive into Single Message Transforms
KIP-66 was added in Apache Kafka 0.10.2 and brought new functionality called Single Message Transforms (SMT). Using SMT you can modify the data and its characteristics as it passes through Kafka Connect pipeline, without needing additional stream processors. For things like manipulating fields, changing topic names, conditionally dropping messages, and more, SMT are a perfect solution. If you get to things like aggregation, joining streams, and lookups then SMT may not be the best for you and you should head over to Kafka Streams or ksqlDB instead.
🎄 Twelve Days of SMT 🎄 - Day 12: Community Transformations
🎄 Twelve Days of SMT 🎄 - Day 11: Predicate and Filter
Apache Kafka 2.6 included KIP-585 which adds support for defining predicates against which transforms are conditionally executed, as well as a Filter
Single Message Transform to drop messages - which in combination means that you can conditionally drop messages.
As part of Apache Kafka, Kafka Connect ships with pre-built Single Message Transforms and Predicates, but you can also write you own. The API for each is documented: Transformation
/ Predicate
. The predicates that ship with Apache Kafka are:
-
RecordIsTombstone
- The value part of the message is null (denoting a tombstone message) -
HasHeaderKey
- Matches if a header exists with the name given -
TopicNameMatches
- Matches based on topic