`+` symbol in file name is replaced with a space

Question

`+` symbol in file name is replaced with a space

damache opened this issue 5 years ago · comments

When consuming data from COS using spark 2.4 configured with stocator 1.0.28, some of the files that have a + in the file name can not be found.

steps to reproduce:

files are stored in COS in this format: ibm.platform.metrics.us-south.metric_datapoint.v1+1+0001903108.avro

run spark shell => spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0,com.ibm.stocator:stocator:1.0.28
load the data => val df = spark.read.format("avro").load("cos://managed-apps-drop-zones.mycos/metrics-v3/*")
run any command against the data => df.count()

additional code showing this is an issue in the cos-sdk:
cos-code-snippet.txt

logs from local spark
avro_load.log

Harsha HegdeIBM · Answer 1 · Wed Feb 13 2019 03:19:09 GMT+0800 (China Standard Time)

This was a known issue that was resolved in COS SDK 2.1.1 and later. Are you using the latest version of Stocator with the updates? See Stocator issue: CODAIT/stocator#171

Gil Vernik · Answer 2 · Wed Feb 13 2019 03:41:28 GMT+0800 (China Standard Time)

@hegdehr we are using the latest COS SDK. I reproduced the issue with COS SDK without using Stocator. Also used UTF-8

Gil Vernik · Answer 3 · Wed Feb 13 2019 03:47:57 GMT+0800 (China Standard Time)

@hegdehr I do found it a bit weird...we had this issue in the past, was resolved as in #171. Now we have the same issue again,and it's pure COS SDK. I reproduced it easily without Stocator

Harsha HegdeIBM · Answer 4 · Wed Feb 13 2019 03:48:59 GMT+0800 (China Standard Time)

Thanks @gilv we will take a look

@barry-hueston-IBM

Dama Reffett · Answer 5 · Wed Feb 13 2019 05:42:16 GMT+0800 (China Standard Time)

Gil's code is attached in the description. cos-code-snippet.txt

smcgrath · Answer 6 · Wed Feb 13 2019 19:42:58 GMT+0800 (China Standard Time)

@gilv I having trouble recreating the issue. I created a new bucket & added 4 files, some with + in name, and snippet below to output their names;
`String bucketName = "mynewbucketgit141";

	ListObjectsRequest request = new ListObjectsRequest();
	request.setBucketName(bucketName);
	request.setMaxKeys(5000);
	request.withEncodingType("UTF-8");

	ObjectListing objectList = s3Client.listObjects(request);
	List<S3ObjectSummary> objectSummaries = objectList.getObjectSummaries();
	
	for (S3ObjectSummary obj : objectSummaries) {
		System.out.println("obj.getKey:" + obj.getKey());
	}`

When encoding is set the output is;
obj.getKey:1+1 obj.getKey:12 obj.getKey:2+2+2 obj.getKey:223

When I comment out the encoding I get;
obj.getKey:1 1 obj.getKey:12 obj.getKey:2 2 2 obj.getKey:223

Are there any steps Im missing here to reproduce?

Dama Reffett · Answer 7 · Thu Feb 14 2019 00:46:48 GMT+0800 (China Standard Time)

@smcgrath-IBM if you read the logs I posted you will see that we have a lot of data in COS. around 3.5 million rows. Not sure how many objects those rows are across (we are working on optimizing this with larger objects), but the objects range from 15 KB to 35 KB.

the logs show there are several requests made to handle all of the objects.

when this line is ran val df = spark.read.format("avro").load("cos://managed-apps-drop-zones.mycos/metrics-v3")

there are at least 3 requests made to get all the objects.

the first request is completed and objects are cached with the + in the filename
the second request for more data is returned, but the objects are cached without the +

the logs show this.

when this line runs df.count() and spark actually tries to load the objects names that were cached without + are not found.

let me pull out the sections from the logs that show this

Dama Reffett · Answer 8 · Thu Feb 14 2019 01:32:44 GMT+0800 (China Standard Time)

I parsed the logs
this starts the request val df = spark.read.format("avro").load("cos://managed-apps-drop-zones.mycos/metrics-v3/*")

initial request to get the objects
initial-request.log
the initial request returns a subset of the data and socator processes it. In the logs from stocator you can see the objects all have +
stocator-process-initial-request.log
second request to get more objects
second-request.log
another subset of objects are returned. the log statements from stocator now show the objects without the +
stocator-process-second-request.log
another request
third-request.log
same as 4 above, no +
stocator-process-third-request.log

this continues until all the files are discovered.

NOTE: the url in the load method of spark had to include /* at the end to force loading of the files. If I ran the load with no trailing path deliminator and wildcard the data was not found.

running any command after this results in the files that don't have + in the cache to be excluded. No idea how or why. using the /*, masked the problem.

Gil Vernik · Answer 9 · Thu Feb 14 2019 01:46:25 GMT+0800 (China Standard Time)

@smcgrath-IBM I failed to read the following object

metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1+5+0001057542.avro

Gil Vernik · Answer 10 · Thu Feb 14 2019 14:13:13 GMT+0800 (China Standard Time)

@smcgrath-IBM I am not sure i follow you..about 6 months ago your team suggested to use "request.withEncodingType("UTF-8");" in order to support "+" sign. CODAIT/stocator#171 .
Now you suggest to remove it?

smcgrath · Answer 11 · Thu Feb 14 2019 23:04:31 GMT+0800 (China Standard Time)

@gilv I can list that object, see below. I haven't made any suggestions yet, still investigating. Can you send me a stacktrace of the error you get attempting to read the object? Are you doing a direct call on s3 outside Stocator?
obj.getKey:1+1 obj.getKey:12 obj.getKey:2+2+2 obj.getKey:223 obj.getKey:333++444 obj.getKey:555 444 33 obj.getKey:metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1+5+0001057542.avro

Gil Vernik · Answer 12 · Thu Feb 14 2019 23:48:57 GMT+0800 (China Standard Time)

@smcgrath-IBM I provided you Java code without any Stocator usage. Please coordinate with @damache. May be she can provide you credentials off line and you can try. I already explained that it's not Stocator issue and can be reproduced easily.

Gil Vernik · Answer 13 · Thu Feb 14 2019 23:55:26 GMT+0800 (China Standard Time)

@smcgrath-IBM I provided all the code to @barry-hueston-IBM. Please co-ordinate with him

Gil Vernik · Answer 14 · Fri Feb 15 2019 00:15:39 GMT+0800 (China Standard Time)

@smcgrath-IBM @barry-hueston-IBM I just used the code i provided you and run list again. Object returned without "+" sign. So please coordinate with @damache .

then you will observe resposes like this

metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1 3 0004659419.avro
metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1 4 0002196820.avro
metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1 5 0001057542.avro
metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1 6 0001986412.avro
metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1 7 0002156080.avro

smcgrath · Answer 15 · Fri Feb 15 2019 04:15:25 GMT+0800 (China Standard Time)

@damache @gilv I can try to replicate from the bucket you have if you want to PM me credentials over Slack

smcgrath · Answer 16 · Fri Feb 15 2019 17:38:55 GMT+0800 (China Standard Time)

@damache @gilv which version of Stocator & cos sdk is been used for this? I see within stocator-process-initial-request.log it is using aws packages to handle the xml response, which would suggest cos sdk is not been used.-

Dama Reffett · Answer 17 · Fri Feb 15 2019 20:10:48 GMT+0800 (China Standard Time)

1.0.28. I just looked on maven and there are two version newer. I can change to a new version and run it now.

Dama Reffett · Answer 18 · Fri Feb 15 2019 20:14:29 GMT+0800 (China Standard Time)

I ran it with the 1.0.30, still get FNF

19/02/15 06:13:31 DEBUG COSInputStream: Stream managed-apps-drop-zones/metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-01-29 22:15/ibm.platform.metrics.us-south.metric_datapoint.v1+3+0004028947.avro aborted: seekInStream(); remaining=72286 streamPos=8196, nextReadPos=0, request range 0-80482 length=80482
19/02/15 06:13:31 DEBUG COSInputStream: reopen(managed-apps-drop-zones/metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-01-29 22:15/ibm.platform.metrics.us-south.metric_datapoint.v1+3+0004028947.avro) for read from new offset range[0-80482], length=16, streamPosition=0, nextReadPosition=0
19/02/15 06:13:31 DEBUG COSAPIClient: Not found cos://managed-apps-drop-zones.mycos/metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-05 21:45/ibm.platform.metrics.us-south.metric_datapoint.v1 3 0004427459.avro. Throw FNF exception

smcgrath · Answer 19 · Fri Feb 15 2019 20:17:18 GMT+0800 (China Standard Time)

Thanks @damache . @gilv you will need to provide some input here as to which underlying SDK is in use. The issue is raised against COS SDK. Ive checked the pom on Stocator master & it is using aws libraries. The log attached to this issue would suggest that also.

Dama Reffett · Answer 20 · Fri Feb 15 2019 20:27:11 GMT+0800 (China Standard Time)

as for the version of cos sdk, maven doesn't print it on the conosle. is it packaged with stocator? in my local .m2 I have ibm-cos-java-sdk-bundle 2.4.1

Dama Reffett · Answer 21 · Fri Feb 15 2019 20:33:51 GMT+0800 (China Standard Time)

I deleted the cos .m2 cache, reran spark and cos sdk didn't get reloaded. I looked at the stocator pom file, I don't see cos listed in there. I haven't worked with java in a few years, so maybe I'm missing something. ivy doesn't have the jar files either.

smcgrath · Answer 22 · Fri Feb 15 2019 20:43:19 GMT+0800 (China Standard Time)

From the 1.0.31-SNAPSHOT build I see on github it is using <amazon.sdk.version>1.11.59</amazon.sdk.version>

Gil Vernik · Answer 23 · Fri Feb 15 2019 21:53:31 GMT+0800 (China Standard Time)

@smcgrath-IBM i am not sure why you keep asking about Stocator, while i keep saying that i provided you Java code that uses COS SDK without any Stocator... Did you saw the code i provided you?

smcgrath · Answer 24 · Fri Feb 15 2019 21:58:17 GMT+0800 (China Standard Time)

@gilv I cant replicate with any objects Ive pushed to my storage or listing objects provided by @damache account. Can you provide me steps to replicate from bucket creation, PutObject & ListObject using COS SDK? The reason Im asking about Stocator is the logs provided are from that application.

Gil Vernik · Answer 25 · Fri Feb 15 2019 22:01:35 GMT+0800 (China Standard Time)

@smcgrath-IBM Did you used the code i provided you?

Gil Vernik · Answer 26 · Fri Feb 15 2019 22:04:02 GMT+0800 (China Standard Time)

@smcgrath-IBM
I have Java code, use @damache account - it show wrong listing.
I provided that code to you.
You claim it works for you? The same excact code??

smcgrath · Answer 27 · Fri Feb 15 2019 22:10:45 GMT+0800 (China Standard Time)

@gilv I extracted out the listObjects initially, it wouldn't replicate. To actually run your code relies on hadoop libraries & configuration files. Please drop in code that I can run using solely COS SDK & replacing accessKey & secretKey. Then I investigate further

Gil Vernik · Answer 28 · Fri Feb 15 2019 22:13:19 GMT+0800 (China Standard Time)

@smcgrath-IBM that code uses Hadoop Configuration dictionary...just remove and write String value instead of it with accessKey. Please adapt the code, it's very simple.

smcgrath · Answer 29 · Fri Feb 15 2019 22:47:51 GMT+0800 (China Standard Time)

@gilv your code is not setting the encoding type when retrieving subsequent listings from the bucket, that is the reason I could not replicate with the listObject method. You need to set it on the objectList
if (isTruncated) {
objectList.setEncodingType("UTF-8");
objectList = mClient.listNextBatchOfObjects(objectList);
objectSummaries = objectList.getObjectSummaries();
} else {
objectScanContinue = false;
}

Craig Muchinsky · Answer 30 · Sat Feb 16 2019 01:16:46 GMT+0800 (China Standard Time)

Shouldn't the original request encoding cascade down into the results it returns?

Gil Vernik · Answer 31 · Sat Feb 16 2019 12:51:29 GMT+0800 (China Standard Time)

@smcgrath-IBM why it's not enough to set

request.withEncodingType("UTF-8"); ?

Also, why it worked in the past?

smcgrath · Answer 32 · Mon Feb 18 2019 03:53:24 GMT+0800 (China Standard Time)

@gilv there may be a bug on the S3 api side, I cant see it returning the encoding type field in the http xml body, which it should, I'll need to look into that piece further.

Gil Vernik · Answer 33 · Mon Feb 18 2019 04:43:54 GMT+0800 (China Standard Time)

@smcgrath-IBM @damache good...then we eventually managed to reproduce the issue and there is work around to use UTF-8 for listing even provided on the request level.
@damache CODAIT/stocator@687e44d
I modified Stocator, it now return "+" correctly. Will make new Stocator release. Will update you once i am done

smcgrath · Answer 34 · Tue Feb 19 2019 19:03:21 GMT+0800 (China Standard Time)

CSAFE-50949 ticket has been created against the S3 api for this issue. Im closing off this issue as it has been raised against the java SDK.