RFC: Enhanced Interface Specification for Protocol

Question

RFC: Enhanced Interface Specification for Protocol

CDU-Ge opened this issue 7 months ago · comments

RFC: Enhanced Interface Specification for Protocol

Feature name	Interface Specification Enhancement
Author(s)	Ce-CDU (cosplox@outlook.com
Updated	2023-10-22

Summary

This RFC proposes an enhancement to the current interface specification for task management. While the existing specification defines interfaces, it lacks detailed descriptions of internal state changes and power responsibilities. This RFC aims to provide a more comprehensive description of these aspects to improve clarity and understanding. The objective is to address potential complexities that may arise due to implicit states within systems.

Motivation

The motivation behind this proposal is to solve the issue of ambiguity and insufficient information in the existing interface specification. This lack of clarity can lead to confusion and challenges in system implementations. It is crucial to provide a valuable problem-solving solution that benefits various users, including humans, other agents, and machines.

Agent Builders Benefit

The proposed changes will benefit agent builders by offering a more detailed and well-defined interface specification. This will lead to a clearer understanding of power responsibilities and internal state changes, ultimately resulting in more robust and predictable system behavior.

Design Proposal

The core of this proposal involves adding detailed descriptions of abstract entities and their interactions. The existing system comprises four key entities, viewed from an abstract perspective:

Task: Represents a task with a specific goal.
Step: Denotes a step within a task.
Action: Corresponds to an action associated with each step.
Artifact: Signifies a persistent resource space owned by an agent.

The proposal suggests that all interfaces should revolve around these entities, including their creation, retrieval, and listing. Additionally, the protocol should include explicit descriptions of ownership relationships between these entities, addressing the question of who should have ownership of these entities.

To illustrate the proposed changes, let's take the "Task" entity as an example. In the current system, three main participants are involved:

User: Interacts with the agent through a client and provides, at a minimum, the task's goal.
Client: Facilitates interactions between the user and the server, accepting user input.
Server: Provides the interface for task management.

Three primary scenarios are considered:

User interacts with the client to generate a Task, complete with metadata generated by a Language Model (LLM). The Task is then created through the interface.
User interacts with the client and uploads data to the server. The server uses LLM to generate metadata and create the Task.
User interacts with the client, locally constructs metadata, and uploads it to the server through the protocol. The server uses the user's data to create the Task.

While the protocol's content remains consistent across these scenarios, the resulting effects and the power responsibilities of the User, Client, and Server differ, leading to potential confusion. The proposal emphasizes the need for the protocol to clearly define the roles and responsibilities of each entity and the state changes represented by the interface.

It's important to note that, in extreme scenarios, the protocol should be capable of addressing complex systems, taking into account potential future plugin system designs.

Detailed Design

These are some details in the design and situations that may need to be faced. They are provided for reference.

The following figures provide a simplified description of the relationships between User, Client, and Server, omitting some details. Although the figures are incomplete due to technical constraints, they cover the main components.

User, Client, and Server Relationship:

sequenceDiagram
title: user, client and server
participant User
participant Client
participant Server

note left of User: Human, Other Agent or Machine
note right of Server: Remote Server, but not one

User -> Client: input target and create Task
Client -> Server: build metadata and create Task
Server --> User: Task Obj: TaskID
User -> Client: Update Task Metadata
Client -> Server: Update Task Metadata
Server -> Client: Updated Task Object
Client -> User: Display Task Metadata
loop Talk
  User --> Client: Run
  Client --> Server: Talk about Step 1(start: 1)
  Server --> Client: Talk about Step 1(start: 1)
  Client -> User: Return Action and Request Run It.(User Feedback)
end
note over User, Server: Many Steps after..., Get RESULT

User, Client, and Server Interaction for Agent:

The following diagram represents the most complex scenario envisioned in the complete system. In this scenario, the author designed components in both the Client and Server for handling input and output, with metadata serving as the central element. Metadata is essential for storing state and third-party data. It is crucial to emphasize that plugins are not considered part of the protocol and should not be included in the standard interface.

graph TD
  start[User Creates Task]
  start --> clientInputPlugin[Handle User Input Client Plugin]
  clientInputPlugin --> clientMetadataBuilder[Build Metadata LLM]
  clientMetadataBuilder --> clientOutputPlugin[Handle Client Output Client Plugin]
  clientOutputPlugin --> ClientRequest[Create Client Request]
  ClientRequest --> ServerAPI[Server API]
  ServerAPI --> serverMiddleware[Server Middleware User Control, Security Checks, etc.]
  serverMiddleware --> ServerRouter[Route Server API]
  ServerRouter --> serverInputPlugin[Handle Server Input Server Plugin]
  serverInputPlugin --> serverMetadataBuilder[Build Metadata LLM]
  serverMetadataBuilder --> serverOutputPlugin[Handle Server Output Server Plugin]
  serverOutputPlugin --> database[Database Operations]
  database --> ServerResponse[Generate Server Response]

The following figure represents the internal abstract entity relationship structure of the Agent, with a focus on the interaction with the User. Please note that no additional entities should be added, as it may complicate the protocol.

The diagram below represents the internal abstract entity relationship structure of the Agent. This diagram is inspired by the implementation of Auto-GPT and simplifies some elements.

In the current Agent design, the core process involves the Language Model (LLM) generating an Action within a given context. The system then provides feedback on this Action, resulting in an output. This loop continues until a final result is obtained, which is evaluated by either humans or the LLM.

Within this structure, the key focus is on how interactions with the User occur during task execution. Notably, interactions take place when a Step generates an Action, and the User either grants permission to execute it or provides feedback. It's essential to highlight that introducing additional entities should be avoided, as they could significantly increase the complexity of the protocol.

In this design, there is a clear separation between the resources available to the system's plug-ins and those available to the Agent. System plug-ins are not involved in the LLM decision-making process, and the resources accessible to the Agent influence the LLM's decision-making outcomes.

While the Agent maintains statefulness, the structure itself is stateless. State changes are driven by the entities operated through the interface, and they do not encompass the existing context. Additionally, the design of Actions opens up the possibility of implementing active plug-ins for the Agent, allowing Steps or Actions to be executed in various locations, including User interactions (Agent feedback), the Client (local system), and the Server.

Agent Internal Abstract Entity Relationship Structure:

graph TD
start[Initialize Agent]
UserPrompt[User Prompt for Task Creation]
BuildAgentMetadata[Build Agent Metadata]
TaskResult[Is Task Result Ready?]

CreateInitStep[Create Initial Step Using Client and Server Metadata]
StepGenerateAction[Generate Actions Using LLM]
WaitUserActionInput[Await User Action or Feedback]
ActionEnd[Record Action Result in Database]
End[End of Task]

start --> UserPrompt
UserPrompt --> BuildAgentMetadata
BuildAgentMetadata --> TaskResult

TaskResult -- yes --> End
TaskResult -- no --> CreateInitStep

CreateInitStep --> StepGenerateAction
StepGenerateAction --> WaitUserActionInput
WaitUserActionInput --> ActionEnd
ActionEnd --> TaskResult

This overall system architecture highlights the need for clear delineation of responsibilities and power relationships among the User, Client, and Server in the protocol design.

Alternatives Considered

The primary alternative considered is maintaining the existing interface specification without making these enhancements. However, this alternative does not address the issue of clarity and may lead to continued confusion in system implementations.

Note on Plugin Entities

It's important to emphasize that adding plugin entities to the interface is beyond the scope of this proposal. The focus here is on enhancing the clarity of the existing interface specification and addressing issues related to entity ownership and state changes.

Compatibility

The design proposal aims to be backward compatible with existing systems. To roll out this feature, it is important to provide clear documentation and guidelines for implementing the updated interface specification. Compatibility checks with other parts of the system, such as SDK and Client SDK, will also be crucial to ensure a seamless transition.

Questions and Discussion Topics

How should the protocol be updated to clearly define entity ownership?
Are there any potential drawbacks or complexities that need to be addressed in the proposal?
How can the proposed changes benefit agent builders and system implementers?

Sasha Dog · Answer 1 · Tue Oct 24 2023 02:54:53 GMT+0800 (China Standard Time)

Super detailed writeup! Thank you! Taking a look...