Batch write from to Kafka does not observe checkpo

2019-08-23 10:51发布

Follow-up from my previous question: I'm writing a large dataframe in a batch from Databricks to Kafka. This generally works fine now. However, some times there are some errors (mostly timeouts). Retrying kicks in and processing will start over again. But this does not seem to observe the checkpoint, which results in duplicates being written to the Kafka sink.

So should checkpoints work in batch-writing mode at all? Or I am missing something?

Config:

EH_SASL = 'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://myeventhub.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=****";'

dfKafka \
.write  \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
.option("topic", "mytopic") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
.save()

1条回答
虎瘦雄心在
2楼-- · 2019-08-23 11:37

Spark checkpoints tend to cause duplicates . Storing and reading Offset from Zookeeper may solve this issue. Here is the link for details :

http://aseigneurin.github.io/2016/05/07/spark-kafka-achieving-zero-data-loss.html

Also, in your case , checkpoints are not working at all or checkpoints are causing duplicates ? Above URL help is for the later case.

查看更多
登录 后发表回答