aliyun / kafka-connect-oss

Kafka Connect suite of connectors for OSS

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OSS sink crashes repeatedly when having to overwrite files

dsarosi opened this issue · comments

When the oss-sink crashes it reads topics again from the beginning of the queue and therefore tries to overwrite files. However, when querying metadata about existing files on OSS it fails. Perhaps an update for hadoop is in order.

[2020-12-18 13:54:47,759] ERROR WorkerSinkTask{id=oss-sink-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted. Error: java.lang.NullPointerException (org.apache.kafka.connect.runtime.WorkerSinkTask)
org.apache.kafka.connect.errors.ConnectException: java.lang.NullPointerException
	at com.aliyun.oss.connect.kafka.storage.OSSStorage.create(OSSStorage.java:88)
	at com.aliyun.oss.connect.kafka.format.json.JsonRecordWriterProvider$1.<init>(JsonRecordWriterProvider.java:64)
	at com.aliyun.oss.connect.kafka.format.json.JsonRecordWriterProvider.getRecordWriter(JsonRecordWriterProvider.java:63)
	at com.aliyun.oss.connect.kafka.format.json.JsonRecordWriterProvider.getRecordWriter(JsonRecordWriterProvider.java:39)
	at com.aliyun.oss.connect.kafka.storage.TopicPartitionWriter.writeRecord(TopicPartitionWriter.java:260)
	at com.aliyun.oss.connect.kafka.storage.TopicPartitionWriter.checkRotationOrAppend(TopicPartitionWriter.java:229)
	at com.aliyun.oss.connect.kafka.storage.TopicPartitionWriter.executeState(TopicPartitionWriter.java:199)
	at com.aliyun.oss.connect.kafka.storage.TopicPartitionWriter.write(TopicPartitionWriter.java:165)
	at com.aliyun.oss.connect.kafka.OSSSinkTask.put(OSSSinkTask.java:173)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:546)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:326)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:228)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:196)
	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:184)
	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:234)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
	at org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.getFileStatus(AliyunOSSFileSystem.java:287)
	at org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.create(AliyunOSSFileSystem.java:117)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1118)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
	at com.aliyun.oss.connect.kafka.storage.OSSStorage.create(OSSStorage.java:86)
	... 19 more
[2020-12-18 13:54:47,761] ERROR WorkerSinkTask{id=oss-sink-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
	at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:568)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:326)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:228)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:196)
	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:184)
	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:234)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.connect.errors.ConnectException: java.lang.NullPointerException
	at com.aliyun.oss.connect.kafka.storage.OSSStorage.create(OSSStorage.java:88)
	at com.aliyun.oss.connect.kafka.format.json.JsonRecordWriterProvider$1.<init>(JsonRecordWriterProvider.java:64)
	at com.aliyun.oss.connect.kafka.format.json.JsonRecordWriterProvider.getRecordWriter(JsonRecordWriterProvider.java:63)
	at com.aliyun.oss.connect.kafka.format.json.JsonRecordWriterProvider.getRecordWriter(JsonRecordWriterProvider.java:39)
	at com.aliyun.oss.connect.kafka.storage.TopicPartitionWriter.writeRecord(TopicPartitionWriter.java:260)
	at com.aliyun.oss.connect.kafka.storage.TopicPartitionWriter.checkRotationOrAppend(TopicPartitionWriter.java:229)
	at com.aliyun.oss.connect.kafka.storage.TopicPartitionWriter.executeState(TopicPartitionWriter.java:199)
	at com.aliyun.oss.connect.kafka.storage.TopicPartitionWriter.write(TopicPartitionWriter.java:165)
	at com.aliyun.oss.connect.kafka.OSSSinkTask.put(OSSSinkTask.java:173)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:546)
	... 10 more
Caused by: java.lang.NullPointerException
	at org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.getFileStatus(AliyunOSSFileSystem.java:287)
	at org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.create(AliyunOSSFileSystem.java:117)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1118)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
	at com.aliyun.oss.connect.kafka.storage.OSSStorage.create(OSSStorage.java:86)
	... 19 more

@dsarosi Sorry for later reply. Could you please reproduce this issue and give me the bucket and object name that causes this exception?

@wujinhu It's really difficult to recreate the problem. It only happens when the oss-sink crashes and then restarts. It starts uploading from the beginning of the queue for the topic and tries to upload files that already exist in the bucket, so it crashes over and over again as it can't recover. If we flush the Kafka queue for that topic and restart the oss-sink it runs fine again for days until the next crash. It somehow needs to be able to deal with the case that if a file exists on OSS already then it should either replace it or ignore the upload.

Weird that it fails on this line.

if (LOG.isDebugEnabled()) { LOG.debug("Adding: fi: " + keyPath); }

Is there something that needs to be configured?

We use the following log level.

curl http://kafka-cluster-cp-kafka-connect:8083/admin/loggers/com.aliyun.oss.connect.kafka.OSSSinkConnector
{"level":"INFO"}

This is our configuration.

{
	"name": "oss-sink",
	"config": {
		"connector.class": "com.aliyun.oss.connect.kafka.OSSSinkConnector",
		"partition.duration.ms": "3600000",
		"flush.size": "40",
		"topics": "kmon__v1alpha__zwisstex_eu__telemetry__json",
		"tasks.max": "4",
		"timezone": "Asia/Shanghai",
		"rotate.interval.ms": "3600000",
		"locale": "US",
		"value.converter.schema.registry.url": "http://kafka-cluster-cp-schema-registry:8081",
		"format.class": "com.aliyun.oss.connect.kafka.format.json.JsonFormat",
		"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
		"name": "oss-sink",
		"value.converter.schemas.enable": "true",
		"value.converter": "io.confluent.connect.json.JsonSchemaConverter",
		"storage.class": "com.aliyun.oss.connect.kafka.storage.OSSStorage",
		"key.converter": "org.apache.kafka.connect.storage.StringConverter",
		"timestamp.extractor": "Record",
		"path.format": "YYYY-MM-dd-HH",
		"oss.bucket": "staging-kapi-datalake-eu"
	},
	"tasks": [{
		"connector": "oss-sink",
		"task": 0
	}, {
		"connector": "oss-sink",
		"task": 1
	}, {
		"connector": "oss-sink",
		"task": 2
	}, {
		"connector": "oss-sink",
		"task": 3
	}],
	"type": "sink"
}

@dsarosi From the stacktrace above

Caused by: java.lang.NullPointerException
	at org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.getFileStatus(AliyunOSSFileSystem.java:287)
	at org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.create(AliyunOSSFileSystem.java:117)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1118)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
	at com.aliyun.oss.connect.kafka.storage.OSSStorage.create(OSSStorage.java:86)

It thrown NPE in hadoop code that tried to get last modifed-time of the object.
A52801DF-6D99-40F5-A998-1FBAAD2FC0BE

So could you please provide the object that caused this exception and we will continue to investigate this issue.

I have another guess. From oss java sdk code

public Date getLastModified() {
        return (Date) metadata.get(OSSHeaders.LAST_MODIFIED);
    }

OSSHeaders.LAST_MODIFIED is upper case, could you please check your network env and test whether the http headers in response are upper case or lower case.

I get the following via curl.

* TCP_NODELAY set
* Connected to staging-kapi-datalake-eu.oss-eu-central-1.aliyuncs.com (47.254.186.77) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: C=CN; ST=ZheJiang; L=HangZhou; O=Alibaba (China) Technology Co., Ltd.; CN=*.oss-eu-central-1.aliyuncs.com
*  start date: Jan 25 09:21:01 2021 GMT
*  expire date: Feb 26 09:21:01 2022 GMT
*  subjectAltName: host "staging-kapi-datalake-eu.oss-eu-central-1.aliyuncs.com" matched cert's "*.oss-eu-central-1.aliyuncs.com"
*  issuer: C=BE; O=GlobalSign nv-sa; CN=GlobalSign Organization Validation CA - SHA256 - G2
*  SSL certificate verify ok.
> GET /topics/kmon__v1alpha__zwisstex_eu__telemetry__json/2021-01-27-19/kmon__v1alpha__zwisstex_eu__telemetry__json%2B0%2B0001578272.json?Expires=1611824509 HTTP/1.1
> Host: staging-kapi-datalake-eu.oss-eu-central-1.aliyuncs.com
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: AliyunOSS
< Date: Thu, 28 Jan 2021 08:57:57 GMT
< Content-Type: application/octet-stream
< Content-Length: 8273
< Connection: keep-alive
< x-oss-request-id: 60127C9541687D333860018E
< Accept-Ranges: bytes
< ETag: "B60762A9B722E31A2F41684BCF710087"
< Last-Modified: Wed, 27 Jan 2021 11:00:21 GMT
< x-oss-object-type: Normal
< x-oss-hash-crc64ecma: 13326951938908773481
< x-oss-storage-class: Standard
< Content-MD5: tgdiqbci4xovQWhLz3EAhw==
< x-oss-server-time: 12

aliyun/aliyun-oss-java-sdk#266
Some environments change http headers, @dsarosi do you think we should upgrade oss java sdk version and test whether this issue still exists?

Yes. I think you probably just have to upgrade hadoop to version 3.2.2. It uses a newer SDK.

image

OK, @dsarosi I have upgraded hadoop and oss sdk version, please have a try, thanks.

Thank you @wujinhu, we'll upgrade to the latest version and test it out.