`+` symbol in file name is replaced with a space
damache opened this issue · comments
When consuming data from COS using spark 2.4 configured with stocator 1.0.28, some of the files that have a +
in the file name can not be found.
steps to reproduce:
files are stored in COS in this format: ibm.platform.metrics.us-south.metric_datapoint.v1+1+0001903108.avro
- run spark shell => spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0,com.ibm.stocator:stocator:1.0.28
- load the data => val df = spark.read.format("avro").load("cos://managed-apps-drop-zones.mycos/metrics-v3/*")
- run any command against the data => df.count()
additional code showing this is an issue in the cos-sdk:
cos-code-snippet.txt
logs from local spark
avro_load.log
This was a known issue that was resolved in COS SDK 2.1.1 and later. Are you using the latest version of Stocator with the updates? See Stocator issue: CODAIT/stocator#171
@hegdehr we are using the latest COS SDK. I reproduced the issue with COS SDK without using Stocator. Also used UTF-8
@hegdehr I do found it a bit weird...we had this issue in the past, was resolved as in #171. Now we have the same issue again,and it's pure COS SDK. I reproduced it easily without Stocator
Thanks @gilv we will take a look
@barry-hueston-IBM
Gil's code is attached in the description. cos-code-snippet.txt
@gilv I having trouble recreating the issue. I created a new bucket & added 4 files, some with + in name, and snippet below to output their names;
`String bucketName = "mynewbucketgit141";
ListObjectsRequest request = new ListObjectsRequest();
request.setBucketName(bucketName);
request.setMaxKeys(5000);
request.withEncodingType("UTF-8");
ObjectListing objectList = s3Client.listObjects(request);
List<S3ObjectSummary> objectSummaries = objectList.getObjectSummaries();
for (S3ObjectSummary obj : objectSummaries) {
System.out.println("obj.getKey:" + obj.getKey());
}`
When encoding is set the output is;
obj.getKey:1+1 obj.getKey:12 obj.getKey:2+2+2 obj.getKey:223
When I comment out the encoding I get;
obj.getKey:1 1 obj.getKey:12 obj.getKey:2 2 2 obj.getKey:223
Are there any steps Im missing here to reproduce?
@smcgrath-IBM if you read the logs I posted you will see that we have a lot of data in COS. around 3.5 million rows. Not sure how many objects those rows are across (we are working on optimizing this with larger objects), but the objects range from 15 KB to 35 KB.
the logs show there are several requests made to handle all of the objects.
when this line is ran val df = spark.read.format("avro").load("cos://managed-apps-drop-zones.mycos/metrics-v3")
there are at least 3 requests made to get all the objects.
the first request is completed and objects are cached with the +
in the filename
the second request for more data is returned, but the objects are cached without the +
the logs show this.
when this line runs df.count()
and spark actually tries to load the objects names that were cached without +
are not found.
let me pull out the sections from the logs that show this
I parsed the logs
this starts the request val df = spark.read.format("avro").load("cos://managed-apps-drop-zones.mycos/metrics-v3/*")
-
initial request to get the objects
initial-request.log -
the initial request returns a subset of the data and socator processes it. In the logs from stocator you can see the objects all have
+
stocator-process-initial-request.log -
second request to get more objects
second-request.log -
another subset of objects are returned. the log statements from stocator now show the objects without the
+
stocator-process-second-request.log -
another request
third-request.log -
same as 4 above, no
+
stocator-process-third-request.log
this continues until all the files are discovered.
NOTE: the url in the load method of spark had to include /*
at the end to force loading of the files. If I ran the load with no trailing path deliminator and wildcard the data was not found.
running any command after this results in the files that don't have +
in the cache to be excluded. No idea how or why. using the /*
, masked the problem.
@smcgrath-IBM I failed to read the following object
metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1+5+0001057542.avro
@smcgrath-IBM I am not sure i follow you..about 6 months ago your team suggested to use "request.withEncodingType("UTF-8");" in order to support "+" sign. CODAIT/stocator#171 .
Now you suggest to remove it?
@gilv I can list that object, see below. I haven't made any suggestions yet, still investigating. Can you send me a stacktrace of the error you get attempting to read the object? Are you doing a direct call on s3 outside Stocator?
obj.getKey:1+1 obj.getKey:12 obj.getKey:2+2+2 obj.getKey:223 obj.getKey:333++444 obj.getKey:555 444 33 obj.getKey:metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1+5+0001057542.avro
@smcgrath-IBM I provided you Java code without any Stocator usage. Please coordinate with @damache. May be she can provide you credentials off line and you can try. I already explained that it's not Stocator issue and can be reproduced easily.
@smcgrath-IBM I provided all the code to @barry-hueston-IBM. Please co-ordinate with him
@smcgrath-IBM @barry-hueston-IBM I just used the code i provided you and run list again. Object returned without "+" sign. So please coordinate with @damache .
then you will observe resposes like this
metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1 3 0004659419.avro
metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1 4 0002196820.avro
metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1 5 0001057542.avro
metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1 6 0001986412.avro
metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1 7 0002156080.avro
1.0.28. I just looked on maven and there are two version newer. I can change to a new version and run it now.
I ran it with the 1.0.30, still get FNF
19/02/15 06:13:31 DEBUG COSInputStream: Stream managed-apps-drop-zones/metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-01-29 22:15/ibm.platform.metrics.us-south.metric_datapoint.v1+3+0004028947.avro aborted: seekInStream(); remaining=72286 streamPos=8196, nextReadPos=0, request range 0-80482 length=80482
19/02/15 06:13:31 DEBUG COSInputStream: reopen(managed-apps-drop-zones/metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-01-29 22:15/ibm.platform.metrics.us-south.metric_datapoint.v1+3+0004028947.avro) for read from new offset range[0-80482], length=16, streamPosition=0, nextReadPosition=0
19/02/15 06:13:31 DEBUG COSAPIClient: Not found cos://managed-apps-drop-zones.mycos/metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-05 21:45/ibm.platform.metrics.us-south.metric_datapoint.v1 3 0004427459.avro. Throw FNF exception
as for the version of cos sdk, maven doesn't print it on the conosle. is it packaged with stocator? in my local .m2 I have ibm-cos-java-sdk-bundle 2.4.1
I deleted the cos .m2 cache, reran spark and cos sdk didn't get reloaded. I looked at the stocator pom file, I don't see cos listed in there. I haven't worked with java in a few years, so maybe I'm missing something. ivy doesn't have the jar files either.
From the 1.0.31-SNAPSHOT build I see on github it is using <amazon.sdk.version>1.11.59</amazon.sdk.version>
@smcgrath-IBM i am not sure why you keep asking about Stocator, while i keep saying that i provided you Java code that uses COS SDK without any Stocator... Did you saw the code i provided you?
@smcgrath-IBM Did you used the code i provided you?
@smcgrath-IBM
I have Java code, use @damache account - it show wrong listing.
I provided that code to you.
You claim it works for you? The same excact code??
@gilv I extracted out the listObjects initially, it wouldn't replicate. To actually run your code relies on hadoop libraries & configuration files. Please drop in code that I can run using solely COS SDK & replacing accessKey & secretKey. Then I investigate further
@smcgrath-IBM that code uses Hadoop Configuration dictionary...just remove and write String value instead of it with accessKey. Please adapt the code, it's very simple.
@gilv your code is not setting the encoding type when retrieving subsequent listings from the bucket, that is the reason I could not replicate with the listObject method. You need to set it on the objectList
if (isTruncated) {
objectList.setEncodingType("UTF-8");
objectList = mClient.listNextBatchOfObjects(objectList);
objectSummaries = objectList.getObjectSummaries();
} else {
objectScanContinue = false;
}
Shouldn't the original request encoding cascade down into the results it returns?
@smcgrath-IBM why it's not enough to set
request.withEncodingType("UTF-8"); ?
Also, why it worked in the past?
@gilv there may be a bug on the S3 api side, I cant see it returning the encoding type field in the http xml body, which it should, I'll need to look into that piece further.
@smcgrath-IBM @damache good...then we eventually managed to reproduce the issue and there is work around to use UTF-8 for listing even provided on the request level.
@damache CODAIT/stocator@687e44d
I modified Stocator, it now return "+" correctly. Will make new Stocator release. Will update you once i am done
CSAFE-50949 ticket has been created against the S3 api for this issue. Im closing off this issue as it has been raised against the java SDK.