OSS sink crashes repeatedly when having to overwrite files

Question

OSS sink crashes repeatedly when having to overwrite files

dsarosi opened this issue 4 years ago · comments

When the oss-sink crashes it reads topics again from the beginning of the queue and therefore tries to overwrite files. However, when querying metadata about existing files on OSS it fails. Perhaps an update for hadoop is in order.

[2020-12-18 13:54:47,759] ERROR WorkerSinkTask{id=oss-sink-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted. Error: java.lang.NullPointerException (org.apache.kafka.connect.runtime.WorkerSinkTask)
org.apache.kafka.connect.errors.ConnectException: java.lang.NullPointerException
	at com.aliyun.oss.connect.kafka.storage.OSSStorage.create(OSSStorage.java:88)
	at com.aliyun.oss.connect.kafka.format.json.JsonRecordWriterProvider$1.<init>(JsonRecordWriterProvider.java:64)
	at com.aliyun.oss.connect.kafka.format.json.JsonRecordWriterProvider.getRecordWriter(JsonRecordWriterProvider.java:63)
	at com.aliyun.oss.connect.kafka.format.json.JsonRecordWriterProvider.getRecordWriter(JsonRecordWriterProvider.java:39)
	at com.aliyun.oss.connect.kafka.storage.TopicPartitionWriter.writeRecord(TopicPartitionWriter.java:260)
	at com.aliyun.oss.connect.kafka.storage.TopicPartitionWriter.checkRotationOrAppend(TopicPartitionWriter.java:229)
	at com.aliyun.oss.connect.kafka.storage.TopicPartitionWriter.executeState(TopicPartitionWriter.java:199)
	at com.aliyun.oss.connect.kafka.storage.TopicPartitionWriter.write(TopicPartitionWriter.java:165)
	at com.aliyun.oss.connect.kafka.OSSSinkTask.put(OSSSinkTask.java:173)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:546)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:326)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:228)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:196)
	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:184)
	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:234)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
	at org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.getFileStatus(AliyunOSSFileSystem.java:287)
	at org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.create(AliyunOSSFileSystem.java:117)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1118)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
	at com.aliyun.oss.connect.kafka.storage.OSSStorage.create(OSSStorage.java:86)
	... 19 more
[2020-12-18 13:54:47,761] ERROR WorkerSinkTask{id=oss-sink-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask)
org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
	at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:568)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:326)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:228)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:196)
	at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:184)
	at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:234)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.connect.errors.ConnectException: java.lang.NullPointerException
	at com.aliyun.oss.connect.kafka.storage.OSSStorage.create(OSSStorage.java:88)
	at com.aliyun.oss.connect.kafka.format.json.JsonRecordWriterProvider$1.<init>(JsonRecordWriterProvider.java:64)
	at com.aliyun.oss.connect.kafka.format.json.JsonRecordWriterProvider.getRecordWriter(JsonRecordWriterProvider.java:63)
	at com.aliyun.oss.connect.kafka.format.json.JsonRecordWriterProvider.getRecordWriter(JsonRecordWriterProvider.java:39)
	at com.aliyun.oss.connect.kafka.storage.TopicPartitionWriter.writeRecord(TopicPartitionWriter.java:260)
	at com.aliyun.oss.connect.kafka.storage.TopicPartitionWriter.checkRotationOrAppend(TopicPartitionWriter.java:229)
	at com.aliyun.oss.connect.kafka.storage.TopicPartitionWriter.executeState(TopicPartitionWriter.java:199)
	at com.aliyun.oss.connect.kafka.storage.TopicPartitionWriter.write(TopicPartitionWriter.java:165)
	at com.aliyun.oss.connect.kafka.OSSSinkTask.put(OSSSinkTask.java:173)
	at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:546)
	... 10 more
Caused by: java.lang.NullPointerException
	at org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.getFileStatus(AliyunOSSFileSystem.java:287)
	at org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.create(AliyunOSSFileSystem.java:117)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1118)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
	at com.aliyun.oss.connect.kafka.storage.OSSStorage.create(OSSStorage.java:86)
	... 19 more

Jinhu Wu · Answer 1 · Tue Jan 26 2021 21:10:49 GMT+0800 (China Standard Time)

@dsarosi Sorry for later reply. Could you please reproduce this issue and give me the bucket and object name that causes this exception?

dsarosi · Answer 2 · Thu Jan 28 2021 12:10:33 GMT+0800 (China Standard Time)

@wujinhu It's really difficult to recreate the problem. It only happens when the oss-sink crashes and then restarts. It starts uploading from the beginning of the queue for the topic and tries to upload files that already exist in the bucket, so it crashes over and over again as it can't recover. If we flush the Kafka queue for that topic and restart the oss-sink it runs fine again for days until the next crash. It somehow needs to be able to deal with the case that if a file exists on OSS already then it should either replace it or ignore the upload.

dsarosi · Answer 3 · Thu Jan 28 2021 12:30:10 GMT+0800 (China Standard Time)

Weird that it fails on this line.

if (LOG.isDebugEnabled()) { LOG.debug("Adding: fi: " + keyPath); }

Is there something that needs to be configured?

dsarosi · Answer 4 · Thu Jan 28 2021 12:53:56 GMT+0800 (China Standard Time)

We use the following log level.

curl http://kafka-cluster-cp-kafka-connect:8083/admin/loggers/com.aliyun.oss.connect.kafka.OSSSinkConnector
{"level":"INFO"}

dsarosi · Answer 5 · Thu Jan 28 2021 12:59:03 GMT+0800 (China Standard Time)

This is our configuration.

{
	"name": "oss-sink",
	"config": {
		"connector.class": "com.aliyun.oss.connect.kafka.OSSSinkConnector",
		"partition.duration.ms": "3600000",
		"flush.size": "40",
		"topics": "kmon__v1alpha__zwisstex_eu__telemetry__json",
		"tasks.max": "4",
		"timezone": "Asia/Shanghai",
		"rotate.interval.ms": "3600000",
		"locale": "US",
		"value.converter.schema.registry.url": "http://kafka-cluster-cp-schema-registry:8081",
		"format.class": "com.aliyun.oss.connect.kafka.format.json.JsonFormat",
		"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
		"name": "oss-sink",
		"value.converter.schemas.enable": "true",
		"value.converter": "io.confluent.connect.json.JsonSchemaConverter",
		"storage.class": "com.aliyun.oss.connect.kafka.storage.OSSStorage",
		"key.converter": "org.apache.kafka.connect.storage.StringConverter",
		"timestamp.extractor": "Record",
		"path.format": "YYYY-MM-dd-HH",
		"oss.bucket": "staging-kapi-datalake-eu"
	},
	"tasks": [{
		"connector": "oss-sink",
		"task": 0
	}, {
		"connector": "oss-sink",
		"task": 1
	}, {
		"connector": "oss-sink",
		"task": 2
	}, {
		"connector": "oss-sink",
		"task": 3
	}],
	"type": "sink"
}

Jinhu Wu · Answer 6 · Thu Jan 28 2021 16:13:04 GMT+0800 (China Standard Time)

@dsarosi From the stacktrace above

Caused by: java.lang.NullPointerException
	at org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.getFileStatus(AliyunOSSFileSystem.java:287)
	at org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.create(AliyunOSSFileSystem.java:117)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1118)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
	at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
	at com.aliyun.oss.connect.kafka.storage.OSSStorage.create(OSSStorage.java:86)

It thrown NPE in hadoop code that tried to get last modifed-time of the object.

So could you please provide the object that caused this exception and we will continue to investigate this issue.

Jinhu Wu · Answer 7 · Thu Jan 28 2021 16:28:02 GMT+0800 (China Standard Time)

I have another guess. From oss java sdk code

public Date getLastModified() {
        return (Date) metadata.get(OSSHeaders.LAST_MODIFIED);
    }

OSSHeaders.LAST_MODIFIED is upper case, could you please check your network env and test whether the http headers in response are upper case or lower case.

dsarosi · Answer 8 · Thu Jan 28 2021 16:59:40 GMT+0800 (China Standard Time)

I get the following via curl.

* TCP_NODELAY set
* Connected to staging-kapi-datalake-eu.oss-eu-central-1.aliyuncs.com (47.254.186.77) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: C=CN; ST=ZheJiang; L=HangZhou; O=Alibaba (China) Technology Co., Ltd.; CN=*.oss-eu-central-1.aliyuncs.com
*  start date: Jan 25 09:21:01 2021 GMT
*  expire date: Feb 26 09:21:01 2022 GMT
*  subjectAltName: host "staging-kapi-datalake-eu.oss-eu-central-1.aliyuncs.com" matched cert's "*.oss-eu-central-1.aliyuncs.com"
*  issuer: C=BE; O=GlobalSign nv-sa; CN=GlobalSign Organization Validation CA - SHA256 - G2
*  SSL certificate verify ok.
> GET /topics/kmon__v1alpha__zwisstex_eu__telemetry__json/2021-01-27-19/kmon__v1alpha__zwisstex_eu__telemetry__json%2B0%2B0001578272.json?Expires=1611824509 HTTP/1.1
> Host: staging-kapi-datalake-eu.oss-eu-central-1.aliyuncs.com
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: AliyunOSS
< Date: Thu, 28 Jan 2021 08:57:57 GMT
< Content-Type: application/octet-stream
< Content-Length: 8273
< Connection: keep-alive
< x-oss-request-id: 60127C9541687D333860018E
< Accept-Ranges: bytes
< ETag: "B60762A9B722E31A2F41684BCF710087"
< Last-Modified: Wed, 27 Jan 2021 11:00:21 GMT
< x-oss-object-type: Normal
< x-oss-hash-crc64ecma: 13326951938908773481
< x-oss-storage-class: Standard
< Content-MD5: tgdiqbci4xovQWhLz3EAhw==
< x-oss-server-time: 12

Jinhu Wu · Answer 9 · Thu Jan 28 2021 17:43:34 GMT+0800 (China Standard Time)

aliyun/aliyun-oss-java-sdk#266
Some environments change http headers, @dsarosi do you think we should upgrade oss java sdk version and test whether this issue still exists?

dsarosi · Answer 10 · Thu Jan 28 2021 18:17:13 GMT+0800 (China Standard Time)

Yes. I think you probably just have to upgrade hadoop to version 3.2.2. It uses a newer SDK.

Jinhu Wu · Answer 11 · Thu Jan 28 2021 19:31:52 GMT+0800 (China Standard Time)

OK, @dsarosi I have upgraded hadoop and oss sdk version, please have a try, thanks.

dsarosi · Answer 12 · Fri Jan 29 2021 10:26:37 GMT+0800 (China Standard Time)

Thank you @wujinhu, we'll upgrade to the latest version and test it out.