We previously looked at the background to getting XML into Kafka, and potentially how [not] to do it. Now let’s look at the proper way to build a streaming ingestion pipeline for XML into Kafka, using Kafka Connect.
If you’re unfamiliar with Kafka Connect, check out this quick intro to Kafka Connect here. Kafka Connect’s excellent plugable architecture means that we can pair any source connector to read XML from wherever we have it (for example, a flat file, or a MQ, or anywhere else), with a Single Message Transform to transform the XML into a payload with a schema, and finally a converter to serialise the data in a form that we would like to use such as Avro or Protobuf.
What would a blog post on rmoff.net be if it didn’t include the dirty hack option? 😁
The secret to dirty hacks is that they are often rather effective and when needs must, they can suffice. If you’re prototyping and need to JFDI, a dirty hack is just fine. If you’re looking for code to run in Production, then a dirty hack probably is not fine.
XML has been around for 20+ years, and whilst other ways of serialising our data have gained popularity in more recent times (such as JSON, Avro, and Protobuf), XML is not going away soon. Part of that is down to technical reasons (clearly defined and documented schemas), and part of it is simply down to enterprise inertia - having adopted XML for systems in the last couple of decades, they’re not going to be changing now just for some short-term fad.
One of my favourite hacks for getting data into Kafka is using kafkacat and stdin, often from jq. You can see this in action with Wi-Fi data, IoT data, and data from a REST endpoint. This is fine for getting values into a Kafka message - but Kafka messages are key/value, and being able to specify a key is can often be important.
Here’s a way to do that, using a separator and some jq magic. Note that at the moment kafkacat only supports single byte separator characters, so you need to choose carefully. If you pick a separator that also appears in your data, it’s possibly going to have unintended consequences.
Readers of a certain age and RDBMS background will probably remember northwind, or HR, or OE databases - or quite possibly not just remember them but still be using them. Hardcoded sample data is fine, and it’s great for repeatable tutorials and examples - but it’s boring as heck if you want to build an example with something that isn’t using the same data set for the 100th time.
Prompted by a question on StackOverflow I thought I’d take a quick look at setting up ksqlDB to ingest CDC events from Microsoft SQL Server using Debezium. Some of this is based on my previous article, Streaming data from SQL Server to Kafka to Snowflake ❄️ with Kafka Connect.
Setting up the Docker Compose I like standalone, repeatable, demo code. For that reason I love using Docker Compose and I embed everything in there - connector installation, the kitchen sink - the works.
I use Hugo for my blog, hosted on GitHub pages. One of the reasons I’m really happy with it is that I can use Asciidoc to author my posts. I was writing a blog recently in which I wanted to include some code that’s hosted on GitHub. I could have copied & pasted it into the blog but that would be lame!
With Asciidoc you can use the include:: directive to include both local files:
There’s ways, and then there’s ways, to count the number of records/events/messages in a Kafka topic. Most of them are potentially inaccurate, or inefficient, or both. Here’s one that falls into the potentially inefficient category, using kafkacat to read all the messages and pipe to wc which with the -l will tell you how many lines there are, and since each message is a line, how many messages you have in the Kafka topic:
Google Chrome automagically adds sites that you visit which support searching to a list of custom search engines. For each one you can set a keyword which activates it, so based on the above list if I want to search Amazon I can just type a<tab> and then my search term
At the beginning of all this my aim was to learn something new (Go), and use it to write a version of a utility that I’d previously hacked together in Python that checks your Apache Kafka broker configuration for possible problems with the infamous advertised.listeners setting. Check out a blog that I wrote which explains all about Apache Kafka and listener configuration.
You can find the code at https://github.com/rmoff/kafka-listeners It’s been a fun journey, and now I am pleased to be able to show the results of it.
So far I’ve been running all my code either in the Go Tour sandbox, using Go Playground, or from a single file in VS Code. My explorations in the previous article ended up with a a source file that was starting to get a little bit unwieldily, so let’s take a look at how that can be improved.
Within my most recent code, I have the main function and the doProduce function, which is fine when collapsed down:
Having ticked off the basics with an Apache Kafka producer and consumer in Go, let’s now check out the AdminClient. This is useful for checking out metadata about the cluster, creating topics, and stuff like that.