AI-Engineer-Foundation / agent-protocol

Common interface for interacting with AI agents. The protocol is tech stack agnostic - you can use it with any framework for building agents.

Home Page:https://agentprotocol.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RFC: How do you make sensitive actions on behalf of the user ?

waynehamadi opened this issue · comments

Agent Function Protocol

Feature name Example name
Author(s) Name (merwane.hamadi@agpt.co)
RFC PR: None
Updated 2023-08-21

Motivation

People want to send emails with agents. But to send emails on behalf of someone how do you do it ?
So far the solution was to give secrets to the agent. The day we have more sensitive information, it's going to become a problem.

Imagine you have an agent that gives a secret to another to perform a task. At some point you end up with 10 agents reading your secrets. It's just asking for trouble. There is no way anyone will do sensitive actions with an agent (think about paying something on amazon, for example).

That's a shame because these sensitive actions are also the core of the agentic space: if the agent can only toy around with a local file system, then what's the point ?

So how do we actually give the agent the ability to do things on my behalf in my gmail account, linkedin account, amazon account ? (even bank account, let's be crazy)

Agent Builders Benefit

As agent builders, how do we do send emails on gmail, for example ? Do we all create a method for that ? Then we have to make sure our client knows where to put its api key ? And then any time we need a new action (like for example archiving an email), do we actually write this method again ?

And now imagine you want to do things on ann outlook email ? Do you also do it there ? It might have different ways to authenticate. You pretty much need to build everything in house. And we're all doing this at the moment.

Design Proposal

Ok, so instead of doing the action for the client. Let's just tell the user what we want to do. In continuous mode the client will do it automatically without human in the loop. And in manual mode it will ask user's permission to continue.

So in REST (and obviously I know we want to support more web protocols, such as graphQL and websocket), we can literally just copy OpenAI functions:

POST agent/tasks/{{task_id}}/steps
BODY
{
	"input": "Hey I want you to grow my fitness business. I am located in the U.S.A. My executive assistant's email is sarah@fit2fat2fit.com."
}
RESPONSE
{
	"output": "Ok, I will send an email to your assistant to ask her to book a strategic call with https://www.acquisition.com/"
"functions": [
    {
      "name": "send_email_gmail",
      "description": "Send an email through gmail.",
      "parameters": {
        "type": "object",
        "properties": {
          "sender_email": {
            "type": "string",
            "description": "The sender's email"
          },
          "receiver_email": {
            "type": "string",
            "description": "The receiver's email"
          }
        }
      }
    }
  ]
}

And then the client decides makes this sensitive action. This assumes clients that are able to do things. This is an opportunity for us to build a python or javascript client specialized in taking actions, and make this Open Source.

We can then pretty much standardize actions.

I know we're going to have 1 million actions. but it's better than having 10 millions people all working on 10 different types of actions for their agents.

Alternatives Considered

Maybe we can give the secrets to the agent and let it do its thing ? We just let each agent creator create and maintain all these actions ? I think this is pretty hard to do and on top of that, if an agent starts having secrets it could share them to subagents, and now it's a mess.

Compatibility

It's actually backwards compatible

So far the solution was to give secrets to the agent. The day we have more sensitive information, it's going to become a problem.

Question: what do you consider to be the agent? Is it the entire application, or just the part involving LLMs to perform logic?

Taking Auto-GPT as an example, the secrets are never shared with the Agent. The agent proposes an action, and on execution, any necessary secrets are provided to the action from the application's configuration:

image

Question: is this workflow problematic? And why?

I think the answer would be:

  1. YES, this is problematic ...
  2. ... because the Agent Protocol does not provide a standardized, user-friendly way to arrange authentication (or configuration) for actions that need it.

The proposal above is essentially to take the grey blob + authentication in this picture out of the agent application's domain:
image

This is an indirect solution for the problem at hand, and this solution introduces its own complications and limitations to developing an agent, so we should explore other solutions first.

Conclusions

  1. The problem at hand is that the protocol lacks a mechanism through which the user can authenticate actions to be taken by the agent.
  2. Moving execution entirely out of the agent's domain introduces limitations and complications to agent development and architecture, and is not the obvious solution to the stated problem.
  3. We have to go deeper into authentication and agents' workflows to find a good solution.

Level -1: Authentication for actions

The crux of the problem is that at some point the user has to authenticate and/or authorize actions on their behalf. The Agent Protocol does not provide a mechanism for this.

Example: OAuth2 could be helpful, but this would require a hosted service which passes the obtained credentials to the application running the agent. This would make it somewhat more complicated for locally running applications.

If we assume that the agent is hosted as a cloud service, this could be a clean solution:

  1. Agent invokes an action
  2. Action requires credentials
  3. Agent application sends authentication request containing an OAuth2 login link to the Client
  4. Client directs User to login page
  5. User completes login flow -> Agent application receives credentials
  6. Agent application executes authenticated action

Level -2: Where to execute

Let's consider pulling execution of actions out of the domain of the agent, as illustrated above.

I don't think there is a one-size-fits-all solution here, because of the variety of possible actions:

  1. Web-based or virtualized
    • can be outsourced to an executor outside of the agent
      • can be outsourced to a remote service
      • can be outsourced to a local service
    • can also be executed locally
    • can be packaged and re-used/redistributed
  2. Local, e.g. editing local files or interacting with programs
    • must be executed locally
    • can be outsourced to a local executor outside of the agent
    • can be packaged and re-used/redistributed
  3. Internal, e.g. altering internal state
    • must be executed by/on the agent itself
    • specific to the agent -> can NOT be packaged and redistributed

Additional considerations:

  • Some agents may be able to function without category 3 and/or category 2 actions.

  • Category 3 actions (internal) do not have to be considered for standardization since they are limited to the internal process of the agent.

  • Outsourcing execution to a remote service would support only category 1 actions.

  • Outsourcing execution to a local service would support both category 1 and category 2 actions.

  • Doing everything locally complicates the use of established authentication mechanisms such as OAuth2. Right? (I'm not an expert on this)

commented

A few thoughts:

What if we supported the three ideas outlined (as I read them):
1 - Optional capability sharing
2 - Optionally relying on the client to fulfill (and authenticate) specific tasks
3 - Optionally enabling delegated authentication via well-established auth protocols

A - Sharing capabilities (aka. where to execute)

Authentication aside, it would be highly potent for the protocol to allow clients to indicate their ability and willingness to handle specific tasks. Clients may even want to demand that they perform particular jobs themselves. (CAN vs. PREFER vs. DEMAND)

Borrowing from HTTP - Accepts headers, client fingerprinting, etc., enables a lot of helpful user functionality.

Additionally, allowing servers to advertise their capabilities similarly is low-hanging fruit that would expand the protocol in exciting ways, especially in complex systems with multiple agents and skills, where the line between the server (agent), client, and skill gets blurred.

Real-world agents are servers and clients, and one can imagine a chain of agents with the LLM at the end (which some people suggest is also a client with several LLMs behind it). Similarly, skills might be local or remote, and if remote, who is to say there's no agent on the other end?

Crucially, these "advertisements" should be optional, and a server or client doesn't have to support them (except in the "I DEMAND" scenario)

I think this answers the "Where to execute" question because the client can decide to execute if it has the capability (directly or via plugins or orchestration), but by default, it's the server that decides how to fulfill the task with its available skills, whether those skills are local or via some plugin mechanism, or orchestration, or third party services.

B - Functions vs. HTTP - not crucial but rather good food for thought.

TLDR; What if the protocol introduced X-Client-Capabilities and X-Server-Capabilities headers?

HTTP supports a lot of the functionality that Open AI functions exhibit, including bi-directional capability exchange via such headers as HTTP ACCEPT headers, and crucially, it supports custom headers, which are often used for this purpose (via X-My-Custom-Header)

  • Accept header (and others): clients use this to let servers know what content types they support
  • Accept-CH (and others) - servers can use these to ask the client to share Client Hints which reveal client capabilities, like user-agent info (device type, memory, etc.) or even bandwidth (Downlink client hint)
  • Accept-Language, Accept-Encoding, Accept-Path, etc.

Then we've got:

  • Access-Control-Allow-Methods: List of supported methods
  • Access-Control-Allow-Headers: List of supported headers (including custom ones)

C - Delegated Authentication/Authorization (OAuth2, SAML, etc. - Can also work for local agents)

Separately - OAuth2 and SAML are proven technologies for hiding credentials from applications, with lots of drop-in implementations, and can delegate authorization in local scenarios.

The workflow would be much like GitHub Desktop, the gh command line client, or Google's command line tool in local scenarios.

  1. The upstream service (which may be an agent or skill or LLM) says I need credentials, and I support Oauth2 (or SAML)
  2. The agent proxies this to the downstream service (which may be an agent or client)
  3. The request is proxied all the way to the client, which either launches a browser or shows the user the URL and instructions.
  4. The user follows the instructions, and the temp auth credentials (tokens or otherwise) are sent up the chain.

IMO, intermediaries should support the authentication protocol as proxies between the upstream services and the clients. They should only try to be OAuth2 providers (or SAML providers) if they provide the service in question.

Of course, if an upstream server doesn't support Oauth2 or SAML, an intermediary could act as an authentication server, but it'll have to contend with gaining the user's trust.