IBM / ibm-cos-sdk-java

ibm-cos-sdk-java

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`+` symbol in file name is replaced with a space

damache opened this issue · comments

When consuming data from COS using spark 2.4 configured with stocator 1.0.28, some of the files that have a + in the file name can not be found.

steps to reproduce:

files are stored in COS in this format: ibm.platform.metrics.us-south.metric_datapoint.v1+1+0001903108.avro

  1. run spark shell => spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0,com.ibm.stocator:stocator:1.0.28
  2. load the data => val df = spark.read.format("avro").load("cos://managed-apps-drop-zones.mycos/metrics-v3/*")
  3. run any command against the data => df.count()

additional code showing this is an issue in the cos-sdk:
cos-code-snippet.txt

logs from local spark
avro_load.log

This was a known issue that was resolved in COS SDK 2.1.1 and later. Are you using the latest version of Stocator with the updates? See Stocator issue: CODAIT/stocator#171

@hegdehr we are using the latest COS SDK. I reproduced the issue with COS SDK without using Stocator. Also used UTF-8

@hegdehr I do found it a bit weird...we had this issue in the past, was resolved as in #171. Now we have the same issue again,and it's pure COS SDK. I reproduced it easily without Stocator

Thanks @gilv we will take a look

@barry-hueston-IBM

Gil's code is attached in the description. cos-code-snippet.txt

@gilv I having trouble recreating the issue. I created a new bucket & added 4 files, some with + in name, and snippet below to output their names;
`String bucketName = "mynewbucketgit141";

	ListObjectsRequest request = new ListObjectsRequest();
	request.setBucketName(bucketName);
	request.setMaxKeys(5000);
	request.withEncodingType("UTF-8");

	ObjectListing objectList = s3Client.listObjects(request);
	List<S3ObjectSummary> objectSummaries = objectList.getObjectSummaries();
	
	for (S3ObjectSummary obj : objectSummaries) {
		System.out.println("obj.getKey:" + obj.getKey());
	}`

When encoding is set the output is;
obj.getKey:1+1 obj.getKey:12 obj.getKey:2+2+2 obj.getKey:223

When I comment out the encoding I get;
obj.getKey:1 1 obj.getKey:12 obj.getKey:2 2 2 obj.getKey:223

Are there any steps Im missing here to reproduce?

@smcgrath-IBM if you read the logs I posted you will see that we have a lot of data in COS. around 3.5 million rows. Not sure how many objects those rows are across (we are working on optimizing this with larger objects), but the objects range from 15 KB to 35 KB.

the logs show there are several requests made to handle all of the objects.

when this line is ran val df = spark.read.format("avro").load("cos://managed-apps-drop-zones.mycos/metrics-v3")

there are at least 3 requests made to get all the objects.

the first request is completed and objects are cached with the + in the filename
the second request for more data is returned, but the objects are cached without the +

the logs show this.

when this line runs df.count() and spark actually tries to load the objects names that were cached without + are not found.

let me pull out the sections from the logs that show this

I parsed the logs
this starts the request val df = spark.read.format("avro").load("cos://managed-apps-drop-zones.mycos/metrics-v3/*")

  1. initial request to get the objects
    initial-request.log

  2. the initial request returns a subset of the data and socator processes it. In the logs from stocator you can see the objects all have +
    stocator-process-initial-request.log

  3. second request to get more objects
    second-request.log

  4. another subset of objects are returned. the log statements from stocator now show the objects without the +
    stocator-process-second-request.log

  5. another request
    third-request.log

  6. same as 4 above, no +
    stocator-process-third-request.log

this continues until all the files are discovered.

NOTE: the url in the load method of spark had to include /* at the end to force loading of the files. If I ran the load with no trailing path deliminator and wildcard the data was not found.

running any command after this results in the files that don't have + in the cache to be excluded. No idea how or why. using the /*, masked the problem.

@smcgrath-IBM I failed to read the following object

metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1+5+0001057542.avro

@smcgrath-IBM I am not sure i follow you..about 6 months ago your team suggested to use "request.withEncodingType("UTF-8");" in order to support "+" sign. CODAIT/stocator#171 .
Now you suggest to remove it?

@gilv I can list that object, see below. I haven't made any suggestions yet, still investigating. Can you send me a stacktrace of the error you get attempting to read the object? Are you doing a direct call on s3 outside Stocator?
obj.getKey:1+1 obj.getKey:12 obj.getKey:2+2+2 obj.getKey:223 obj.getKey:333++444 obj.getKey:555 444 33 obj.getKey:metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1+5+0001057542.avro

@smcgrath-IBM I provided you Java code without any Stocator usage. Please coordinate with @damache. May be she can provide you credentials off line and you can try. I already explained that it's not Stocator issue and can be reproduced easily.

@smcgrath-IBM I provided all the code to @barry-hueston-IBM. Please co-ordinate with him

@smcgrath-IBM @barry-hueston-IBM I just used the code i provided you and run list again. Object returned without "+" sign. So please coordinate with @damache .

then you will observe resposes like this

metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1 3 0004659419.avro
metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1 4 0002196820.avro
metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1 5 0001057542.avro
metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1 6 0001986412.avro
metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-11 00:30/ibm.platform.metrics.us-south.metric_datapoint.v1 7 0002156080.avro

@damache @gilv I can try to replicate from the bucket you have if you want to PM me credentials over Slack

@damache @gilv which version of Stocator & cos sdk is been used for this? I see within stocator-process-initial-request.log it is using aws packages to handle the xml response, which would suggest cos sdk is not been used.-

1.0.28. I just looked on maven and there are two version newer. I can change to a new version and run it now.

I ran it with the 1.0.30, still get FNF

19/02/15 06:13:31 DEBUG COSInputStream: Stream managed-apps-drop-zones/metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-01-29 22:15/ibm.platform.metrics.us-south.metric_datapoint.v1+3+0004028947.avro aborted: seekInStream(); remaining=72286 streamPos=8196, nextReadPos=0, request range 0-80482 length=80482
19/02/15 06:13:31 DEBUG COSInputStream: reopen(managed-apps-drop-zones/metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-01-29 22:15/ibm.platform.metrics.us-south.metric_datapoint.v1+3+0004028947.avro) for read from new offset range[0-80482], length=16, streamPosition=0, nextReadPosition=0
19/02/15 06:13:31 DEBUG COSAPIClient: Not found cos://managed-apps-drop-zones.mycos/metrics-v3/ibm.platform.metrics.us-south.metric_datapoint.v1/timestamp=2019-02-05 21:45/ibm.platform.metrics.us-south.metric_datapoint.v1 3 0004427459.avro. Throw FNF exception

Thanks @damache . @gilv you will need to provide some input here as to which underlying SDK is in use. The issue is raised against COS SDK. Ive checked the pom on Stocator master & it is using aws libraries. The log attached to this issue would suggest that also.

as for the version of cos sdk, maven doesn't print it on the conosle. is it packaged with stocator? in my local .m2 I have ibm-cos-java-sdk-bundle 2.4.1

I deleted the cos .m2 cache, reran spark and cos sdk didn't get reloaded. I looked at the stocator pom file, I don't see cos listed in there. I haven't worked with java in a few years, so maybe I'm missing something. ivy doesn't have the jar files either.

From the 1.0.31-SNAPSHOT build I see on github it is using <amazon.sdk.version>1.11.59</amazon.sdk.version>

@smcgrath-IBM i am not sure why you keep asking about Stocator, while i keep saying that i provided you Java code that uses COS SDK without any Stocator... Did you saw the code i provided you?

@gilv I cant replicate with any objects Ive pushed to my storage or listing objects provided by @damache account. Can you provide me steps to replicate from bucket creation, PutObject & ListObject using COS SDK? The reason Im asking about Stocator is the logs provided are from that application.

@smcgrath-IBM Did you used the code i provided you?

@smcgrath-IBM
I have Java code, use @damache account - it show wrong listing.
I provided that code to you.
You claim it works for you? The same excact code??

@gilv I extracted out the listObjects initially, it wouldn't replicate. To actually run your code relies on hadoop libraries & configuration files. Please drop in code that I can run using solely COS SDK & replacing accessKey & secretKey. Then I investigate further

@smcgrath-IBM that code uses Hadoop Configuration dictionary...just remove and write String value instead of it with accessKey. Please adapt the code, it's very simple.

@gilv your code is not setting the encoding type when retrieving subsequent listings from the bucket, that is the reason I could not replicate with the listObject method. You need to set it on the objectList
if (isTruncated) {
objectList.setEncodingType("UTF-8");
objectList = mClient.listNextBatchOfObjects(objectList);
objectSummaries = objectList.getObjectSummaries();
} else {
objectScanContinue = false;
}

Shouldn't the original request encoding cascade down into the results it returns?

@smcgrath-IBM why it's not enough to set

request.withEncodingType("UTF-8"); ?

Also, why it worked in the past?

@gilv there may be a bug on the S3 api side, I cant see it returning the encoding type field in the http xml body, which it should, I'll need to look into that piece further.

@smcgrath-IBM @damache good...then we eventually managed to reproduce the issue and there is work around to use UTF-8 for listing even provided on the request level.
@damache CODAIT/stocator@687e44d
I modified Stocator, it now return "+" correctly. Will make new Stocator release. Will update you once i am done

CSAFE-50949 ticket has been created against the S3 api for this issue. Im closing off this issue as it has been raised against the java SDK.