Batch write from to Kafka does not observe checkpo

2019-08-23 10:51发布

Follow-up from my previous question: I'm writing a large dataframe in a batch from Databricks to Kafka. This generally works fine now. However, some times there are some errors (mostly timeouts). Retrying kicks in and processing will start over again. But this does not seem to observe the checkpoint, which results in duplicates being written to the Kafka sink.

So should checkpoints work in batch-writing mode at all? Or I am missing something?

Config:

EH_SASL = 'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://myeventhub.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=****";'

dfKafka \
.write  \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
.option("topic", "mytopic") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
.save()

标签： apache-spark pyspark apache-kafka databricks

1条回答

虎瘦雄心在

2楼-- · 2019-08-23 11:37

Spark checkpoints tend to cause duplicates . Storing and reading Offset from Zookeeper may solve this issue. Here is the link for details :

http://aseigneurin.github.io/2016/05/07/spark-kafka-achieving-zero-data-loss.html

Also, in your case , checkpoints are not working at all or checkpoints are causing duplicates ? Above URL help is for the later case.

0人赞添加讨论(0) 举报

Batch write from to Kafka does not observe checkpo

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间