box / box-windows-sdk-v2

Windows SDK for v2 of the Box API. The SDK is built upon .NET Framework 4.5

Home Page:https://developer.box.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Null Reference Error with BoxJWTAuth.GetToken()

danielledetennis opened this issue · comments

  • [✔️] I have checked that the [SDK documentation][sdk-docs] doesn't solve my issue.
  • [✔️] I have checked that the [API documentation][api-docs] doesn't solve my issue.
  • [✔️] I have searched the [Box Developer Forums][dev-forums] and my issue isn't already reported (or if it has been reported, I have attached a link to it, for reference).
  • [✔️] I have searched [Issues in this repo][github-repo] and my issue isn't already reported.

Description of the Issue

I have also submitted a ticket to Box Support with this issue, Case #2235661
Background:
We were not experiencing this issue until August 20, 2020. After that point, authentication fails for about 1/3 of our users. The failure is not consistent, so one user will have the error for several hours or days, and then be able to use the same API calls without any error. This error does not occur with version 3.21 and earlier, so I believe something in pull #631 is causing this issue along with whatever changed with the authentication around August 20th.

Error Description:
Box.V2.JWTAuth.BoxJWTAuth.AdminToken() sometimes returns a null reference error. It does not return an error for everyone, but once it starts it will continue to give an error to that specific user for several hours or days.

Steps to Reproduce

Here is the full code that is causing the error. The config.json file was generated by our Box app and we've been using it for over a year without issue. The error occurs on the last line.

FileStream fs = new FileStream("config.json", FileMode.Open, FileAccess.Read, FileShare.Read);
IBoxConfig config = BoxConfig.CreateFromJsonFile(fs);
BoxJWTAuth auth = new BoxJWTAuth(config);
return auth.AdminToken();

Expected Behavior

Return a valid authentication token.

Error Message, Including Stack Trace

Object reference not set to an instance of an object
at Box.V2.JWTAuth.BoxJWTAuth.GetToken(String subType, String subId)
at Box.V2.JWTAuth.BoxJWTAuth.AdminToken()

Versions Used

3.21: No problems
3.22 - 3.24: Error described above.

Thanks!

Hi @danielledetennis ,

Thanks for submitting this Issue! We will take a look and get back to you ASAP!

@PJSimon

@danielledetennis Could you post the debug lines printed here https://github.com/box/box-windows-sdk-v2/pull/631/files#diff-160823e6cd213feaa028eb93ca2d9223R212, error logs or the request_id for the failed requests?

The code fails at line 174, before it can get to a debug line with a "badrequest" status code. Let me know what other information would be useful.
image

I just wanted to add a little more information in case it is useful: The screenshot above is from a PC with a static IP. It has been unable to authenticate with Box since August 20th, as shown in logs from an application that runs nightly on that PC. All of our other PCs have dynamic IPs that change every time we restart or log off. I was trying to test the debug lines with my PC today, but I couldn't get it to fail, even though it was consistently failing for me on Wednesday.

Thanks, @danielledetennis!
Could you send us the values of ex.Error.Code and ex.Error.Description?
Better yet, can you output ex.ToString() and share it here? Please be sure redact any account-specific or otherwise sensitive information.

Also, do you know if GetToken() is being called from AdminToken() or UserToken()?

Thanks!

GetToken() is being called from AdminToken().

"Unfortunately" the computer that was unable to authenticate is suddenly not having any issues. I will try to get it to fail again tomorrow.

I've tried it out on half a dozen PCs this morning, but I can't get the authentication to fail on any of them. It seems to have fixed itself?

I have an application that runs daily and takes advantage of some of the more recent changes to the SDK. User-run applications can stay on 3.21 for now, and I'll keep watching the logs of the daily application on 3.24 to see if it fails again.

I have also attached a test to one of the user applications that attempts an authentication with 3.24 when the user launches the application. It logs the timestamp, computer, and IP address along with any errors. That will give me a larger chance to catch the error.

@danielledetennis Thanks for all the info.

My first thought was that the JWT assertion is being created with a bad DateTime, caused by the clock being off, compared to the time on Box servers. Now, if that were the case, the if statement on line 174 in your screenshot would have evaluated to TRUE and the JWT assertion would have been recreated and the request re-sent, because all three cases would be true:
(ex.StatusCode == HttpStatusCode.BadRequest && ex.Error.Code.Contains("invalid_grant") && ex.Error.Description.Contains("exp"))

I just wanted to double-check that you were saying in your comment that on line 174, the whole condition actually evaluates to FALSE and the next line of execution is the else statement, which just throws the exception, right? So there's no retrying happening at all here, right? In that case, then yes, we will need the full exception details to see the reason for the "Bad Request."

The timestamp and IP address you provide will help us with potentially looking at logs on Box servers. The request_id or visitor_id will also be very helpful.

I'll leave this issue open so we can check back with you if we don't hear from you in a few days.

Yes, I wish I would have looked at it a little closer or left it up in debug mode over the weekend. I had assumed the BadRequest status just wasn't expected in the switch. The whole statement evaluates to false so it throws the exception next, skipping the debug statement and retry.

Where would I find the request_id or visitor_id?

I think this issue is same as #645
I wrote reproduce steps for it.

Thanks @mkiosugadaikichi I think you're right!

I followed your steps and was able to throw the same error. It fails because ex.Error.Code is null.

I think the main issue is that whatever is causing this issue is so inconsistent. I had seen that issue and thought it was similar, but since I didn't experience it at that time I assumed it was a different issue. However, The user application that started throwing this error on August 20 was developed in early June, so I missed the time period #645 was causing an error. I did have two applications running on a schedule on a computer at that time, but by chance that computer might not have had the error since only about 1/3 of users got the error at a time when it was occurring. All other applications were likely running 3.21 or earlier.

Since Monday, I've logged around 75 tests from different users authenticating with our Box app. There have been no errors.

This issue started up again today after 2 weeks with no issues. 3 out of 9 connections tested came back with an error. Where can I send you the timestamps and IP addresses privately?

My co-worker also experienced this issue from the Java SDK. From there it threw a more descriptive error message:

com.box.sdk.BoxAPIResponseException: The API returned an error code [400] invalid_grant - Current date/time MUST be before the expiration date/time listed in the 'exp' claim
The API returned an error code [400] invalid_grant - Current date/time MUST be before the expiration date/time listed in the 'exp' claim

Hi @danielledetennis ,

Sorry for the delay. You can send the timestamps and IPs to me here: [my first name]@box.com Could you also include the values of ex.Error.Code and ex.Error.Description?
Better yet, can you output ex.ToString() and include that, too? Thanks!

In the meantime, I'll take a look at how #645 might be related. You said that ex.Error.Code is null when you reproduced #645, but we still haven't determined if ex.Error.Code is null for your issue, right?

It sounds like the time setting on the computer sending the requests via the Java SDK is out of sync with (that is, ahead of) the Box server. Have you been able to verify that this is the error message you are getting in the .NET SDK, as well?

Are you using the retry logic in the SDK? As long as the server time is returned in the response, and I think it should be, the SDK should reconstruct the JTI claim using an expiration time based off the actual server time.

I'm looking forward to digging into this further with you this week.

Thanks,

Patrick

Hi @danielledetennis,

I haven't seen an email come through, so I apologize if I've missed it. Also, since I posted my email address here, I suddenly started getting tons of spam so I edited it to say that my email address is just my first name, and then "@box.com". It's also listed in my GitHub profile.

Do you have any update from your end? Standing by...

Thanks,

Patrick

Thanks Patrick, I'll send an email in the next few minutes. I just got back from a vacation, so I'm catching up on some email.

On our end, I caught a discrepancy in the time. We use two NTP servers, and I was able to see my clock change between 30 seconds behind to on time about a dozen times in one morning. After a week of errors, it again seems to have resolved itself over the weekend. I have talked to our IT, and they've asked if there can be a larger tolerance for the time. There's more we can do to get our clocks synced, but it's difficult to us to keep the tolerance below 30 seconds at all times, especially if there's a difference between our server and yours even when our server is perfectly synced.

Ex.Error.Code is null, so line 186 of BoxJWTAuth.cs throws the null reference error for ex.Error.Code.Contains("invalid_grant"), and this is where our problem is.

I caught one of our test machines failing my test today, and sure enough, it's clock was off by 46 seconds. I have visual studio installed on this computer, so I ran the code in debug mode to look at it more closely.

image

The error is caught by the error handler in line 165 as expected.

image

But ex.Error.Code is null, so trying to evaluate ex,Error.Code.Contains() triggers the null reference error I am seeing on line 186.

Here is the string value of the exception before line 186 turns it into a null reference error:

Box.V2.Exceptions.BoxException: The API returned an error [BadRequest] invalid_grant - Current date/time MUST be before the expiration date/time listed in the 'exp' claim
at Box.V2.Extensions.BoxResponseExtensions.ParseResults[T](IBoxResponse`1 response, IBoxConverter converter) in Box.V2\Extensions\BoxResponseExtensions.cs:line 81
at Box.V2.JWTAuth.BoxJWTAuth.JWTAuthPost(String assertion) in Box.V2\JWTAuth\BoxJWTAuth.cs:line 277
at Box.V2.JWTAuth.BoxJWTAuth.GetToken(String subType, String subId) in Box.V2\JWTAuth\BoxJWTAuth.cs:line 162

Thanks for the report @danielledetennis!

I don't fully understand why ex.Error.Code is null, yet, but it seems like it might be related to the changes made in version 3.22, since that version included changes in this code. I'll work on reproducing the error response you're getting by intentionally setting delayed time values in my request. There will likely be a fix needed and it would go out with our next release, which should be in the early part of our next sprint (10/05 -10/19).

This fix should catch the failure you're getting and automatically retry the request. Before each retry, the SDK will attempt to use the time from the Box server (in the header of the response) to create a new request, instead of your client's time. This should get around your time discrepancy issue, but it means that whenever the client's clock is off, every request will essentially be sent twice: once with the delayed local time and then again with the correct time. This could affect your rate limit, depending on how often your app authenticates.

So, in the meantime, you should definitely try to resolve your time discrepancy issue. It's always a good thing to know what time it is (unless you're the band Chicago). :^) For security reasons, it's a best practice to keep the expiration of the JWT assertion quite short, so we can't really relax that in the SDK and there's no config option for it.

You mentioned that you are using two NTP servers. Since the issue is intermittent, maybe one is correct and the other isn't?

I'll start work on the PR for this today and I'll link it to this issue so you can track.

Thanks!

Patrick

Thanks.

Our app authenticates only the first time each user opens it, so the rate limit shouldn't be too much of an issue. I am continuing to work with our IT to fix the issue regardless. I also thought it may have been caused by one of the two NTP servers, but apparently we only started using two servers after the first time the issue occurred and resolved itself (as an attempt to fix it). So either the issue only just started with our original NTP or it's another problem entirely.

@danielledetennis I was able to fix the retry logic for authentication requests that return a "clock skew" exception. It should resolve this issue, as well as #645. It will go out with our next release, which should be in the early part of our next sprint (10/05 -10/19).

Great, thanks!

Closing this issue as the fix for this issue (#697) will be in the next release, which should deploy in the next week or two.

@danielledetennis this has been released!
Thanks for your patience!