I've been doing some noodling around with Confluent's Kafka Connect recently, as part of gaining a wider understanding into Kafka. If you're not familiar with Kafka Connect this page gives a good idea of the thinking behind it.
One issue that I hit defeated my Google-fu so I'm recording it here to hopefully help out fellow n00bs.
The pipeline that I'd set up looked like this:
- Eneco's Twitter Source streaming tweets to a Kafka topic
- Confluent's HDFS Sink to stream tweets to HDFS and define Hive table automagically over them
It worked great, but only if I didn't enable the Hive integration part. For me the integration with Hive to automatically define schemas was one of the key interests for this platform, so I wanted to see if I could get it to work. The error I got was
org.apache.kafka.connect.errors.SchemaProjectorException: Schema version required for BACKWARD compatibility
The long and short of it was that I was using the wrong
Converter class for the data that was being written and read by Kafka - instead of Avro I'd used Json.
/etc/kafka/connect-standalone.properties, just copying and pasting examples from the docs, and then gone off on my own config without thinking about it much further. This meant that instead of
value.converter I was using
org.apache.kafka.connect.json.JsonConverter. When you think about it, if you want a schema defined in Hive it's got to come from somewhere; the Avro schema registry that Kafka Connect provides. Once I switched my config to use
/etc/schema-registry/connect-avro-standalone.properties everything worked just perfectly!
You can find the configuration files on gist here.
(photo credit: https://unsplash.com/@dan_carl5on)