cognitect-labs / aws-api

AWS, data driven

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Creds fail to load (from env) when Java11 snap-start enabled

stevebuik opened this issue · comments

Thank you for your interest in helping to improve Cognitect's aws-api!

Dependencies

Be sure to list the precise libs and versions you are using ("the
latest" might change by the time we're looking at your issue).

e.g.

{:deps {com.cognitect.aws/api                      {:mvn/version "0.8.630"}
           com.cognitect.aws/endpoints                {:mvn/version "1.1.12.358"}
           com.cognitect.aws/lambda                   {:mvn/version "825.2.1263.0"}
           com.cognitect.aws/apigatewaymanagementapi  {:mvn/version "821.2.1107.0"}
.. others elided
}}

Description with failing test case

(defn send-message!
  [{:keys [domain-name stage message connection-id]}]
  (println (sort (keys (System/getenv))))
  (let [client (-> {:api                  :apigatewaymanagementapi
                    :endpoint-override    {:hostname domain-name
                                           :path     (format "/%s/@connections/" stage)}
                    :credentials-provider (reify CredentialsProvider
                                            (fetch [_]
                                              {:aws/access-key-id     (u/getenv "AWS_ACCESS_KEY_ID")
                                               :aws/secret-access-key (u/getenv "AWS_SECRET_ACCESS_KEY")
                                               :aws/session-token     (u/getenv "AWS_CONTAINER_AUTHORIZATION_TOKEN")}))}
                   (aws/client))]
    ;(aws/validate-requests client)
    (->> {:op      :PostToConnection
          :request {:ConnectionId connection-id
                    :Data         (.getBytes message)}}
         (aws/invoke client))))

Stack traces

{:message "The security token included in the request is invalid.", :cognitect.anomalies/category :cognitect.anomalies/forbidden}

works fine when snapstart is disabled (albeit with slow cold start) and AWS_SESSION_TOKEN used instead

further tests using container creds led to this variation of the fn

(defn send-message!
  [{:keys [domain-name stage message connection-id]}]

  (println
    "creds response"
    (u/getenv ec2/container-credentials-full-uri-env-var)
    (pr-str
      (a/<!!
        (http/submit
          (cognitect.http-client/create {})
          (#'mu/request-map (URI. (u/getenv ec2/container-credentials-full-uri-env-var)))))))

  (let [client (-> {:api                  :apigatewaymanagementapi
                    :endpoint-override    {:hostname domain-name
                                           :path     (format "/%s/@connections/" stage)}
                    :credentials-provider (credentials/container-credentials-provider (shared/http-client))}
                   (aws/client))]
    ;(aws/validate-requests client)
    (->> {:op      :PostToConnection
          :request {:ConnectionId connection-id
                    :Data         (.getBytes message)}}
         (aws/invoke client))))

which generates this log response...

creds response http://127.0.0.1:9001/2021-04-23/credentials {:status 404, :headers {"x-content-type-options" "nosniff", "content-length" "46", "date" "Thu, 15 Dec 2022 03:51:06 GMT", "content-type" "text/plain; charset=UTF-8"}, :body #object[java.nio.HeapByteBuffer 0x3d526ad9 "java.nio.HeapByteBuffer[pos=0 lim=46 cap=46]"]}

which I presume means that the ec2 credentials request process does not work in the lambda snapstart environment

this works!!

(defn snapstart-credentials-provider
  "return a creds provider that uses the 2 Snapstart environment variables to get current keys/tokens for API requests"
  [http-client]
  (reify CredentialsProvider
    (fetch [_]
      (let [request (-> (u/getenv mu/container-credentials-full-uri-env-var)
                        (URI.)
                        (#'mu/request-map)                  ; TODO has https bug but doesn't affect this provider which uses http/80
                        (assoc-in [:headers "Authorization"] (u/getenv "AWS_CONTAINER_AUTHORIZATION_TOKEN")))
            response (->> request
                          (http/submit http-client)
                          (a/<!!))]
        (when (= 200 (:status response))
          (let [creds (some-> (:body response) (u/bbuf->str) (json/read-str :key-fn keyword))]
            (credentials/valid-credentials
              {:aws/access-key-id     (:AccessKeyId creds)
               :aws/secret-access-key (:SecretAccessKey creds)
               :aws/session-token     (:Token creds)
               ::credentials/ttl      (credentials/calculate-ttl creds)}
              "snapstart container")))))))

(defn send-message!
  [{:keys [domain-name stage message connection-id]}]
  (let [client (-> {:api                  :apigatewaymanagementapi
                    :endpoint-override    {:hostname domain-name
                                           :path     (format "/%s/@connections/" stage)}
                    :credentials-provider (snapstart-credentials-provider (http/create {}))}
                   (aws/client))]
    ;(aws/validate-requests client)
    (->> {:op      :PostToConnection
          :request {:ConnectionId connection-id
                    :Data         (.getBytes message)}}
         (aws/invoke client))))

with that impl, the cold start first request is a bit slow...
REPORT RequestId: 53dd2192-1c24-456d-b226-f3e675810ced Duration: 7296.40 ms Billed Duration: 7458 ms Memory Size: 1024 MB Max Memory Used: 174 MB Restore Duration: 372.46 ms

but subsequent requests are as quick as you would expect...
REPORT RequestId: b6cabdbe-d556-4b89-9313-6ebfc609814c Duration: 187.31 ms Billed Duration: 188 ms Memory Size: 1024 MB Max Memory Used: 175 MB

not sure why the slow first request but it does work. I'll move forward with this until an official release (that might improve cold start perf as well?)

ok, that slow first request has nothing to do with creds.

it's due to the :PostToConnection call which takes 5-7 secs on first call and 200ms in subsequent calls. I'd guess this is some kind of keep-alive in the client (can we control this?) or aws network init delay (which we can't control)

Either way, this looks like a good creds solution. just needs the official port of this impl. let me know if you need any other info

Thanks for digging this up. I think we can fix this in the ec2-metadata-utils (add the header there if the env var is present). That's what the java sdk v2 is doing. I'm surprised this hasn't come up until now!

I pushed a speculative fix to the dc/container-credentials-auth-header branch. It might be a few days before I'll have time to test this properly, so please feel free to build and test it locally.

Thanks. I'll try it today. I have a clarification question. With my solution, I had trouble making it work with (shared/http-client) so that's why I used (http/create {})

there are 2 http client protocols available in this project:
cognitect.http-client/IClient (submittable)
cognitect.aws.http/HttpClient (stoppable)

The container creds fn needs the submittable type. When it's part of the chain it appears to be passed the stoppable. Is this a bug or am I thinking about this wrong?

I ask because I've experienced the client memory leak in Ions in the past so I want to make sure I use this correctly (even though Lambdas are less prone to leaks i.e. cattle not pets)

cognitect.http-client/IClient is part of a dependency, not part of aws-api.

cognitect.aws.http/HttpClient has -submit and -stop, so it's both submittable and stoppable. Structurally, the cognitect http-client is always wrapped in a reification of cognitect.aws.http/HttpClient, created by calling cognitect.aws.http.cognitect/create.

So either gives you instances of the same thing, though (shared/http-client) will always return the same object, so it's possible that that's the source of any trouble you saw since the aws-client's http-client is waiting for the credential provider's http-client to return before it can complete its work. That shouldn't be an issue, but bugs happen, so let me know if you still see the same problem.

thanks for clarifying. I just tested your fix. don't fully understand the error but using

:credentials-provider (ec2/container-credentials (shared/http-client))

produces...

:message "No implementation of method: :fetch of protocol: #'cognitect.aws.credentials/CredentialsProvider found for class: clojure.lang.PersistentArrayMap"

when I switch back to my custom provider using the same commit, it works fine.

Since my feedback loop via CI is slow, I'm gonna pause testing your code because I need to focus on my deliverable. So I'll wait till you have time to setup a test harness - that'll be more effiecient for both of us.

Happy to provide any other info in the meantime or when you are setting up.

No hurry as I have a working solution till then.

Actually, with my fix you shouldn't need to provide a provider at all. If you have time, give that a shot. If not, it can wait.

That was going to be my last test :) I just ran it and it works! I'll stick with your commit now and will report any other issues that arise.

this is now working for multiple lambdas and multiple AWS apis. I would suggest this can be closed but....

I did uncover another bug that is related to this. It's that the TTL on the container creds doesn't appear to be refreshing after a snapstart resume. I'll attach my ADR doc to this comment. It contains details on the problem and my (verified) workaround.

Since I have a workaround, it's not urgent, but I figure you want to know if a TTL isn't working properly i.e. I suspect this is broken outside of the snapstart use-case as well.
adr-0002-cold-starts.md