Call GPT: Generative AI Phone Calling

Generative AI is producing a bunch of fun new models for us devs to poke at. Did you know you can use these over the phone?

Twilio gives you a superpower called Media Streams. Media Streams provides a Websocket connection to both sides of a phone call. You can get audio streamed to you, process it, and send audio back.

This repo serves as a demo exploring three models:

Deepgram for Speech to Text
elevenlabs for Text to Speech
OpenAI for GPT prompt completion

These service combine to create a voice application that is remarkably better at transcribing, understanding, and speaking than traditional IVR systems.

Features:

Returns responses with low latency, typically 1 second by utilizing streaming.
Allows the user to interrupt the GPT assistant and ask a different question.
Maintains chat history with GPT.

Setting up for Development

Use ngrok to tunnel and then expose port 3000

ngrok http 3000

Copy .env.example to .env and add all API keys.

Set SERVER to your tunneled ngrok URL

Install the necessary packages:

npm install

Start the web server:

npm run dev

Wire up your Twilio number using the console or CLI

twilio phone-numbers:update +1[your-twilio-number] --voice-url=https://your-server.ngrok.io/incoming

There is a Stream TwiML verb that will connect a stream to your websocket server.

Application Workflow

CallGPT coordinates the data flow between multiple different services including Deepgram, OpenAI, ElevenLabs, and Twilio Media Streams:

Modifying the ChatGPT Context & Prompt

Within gpt-service.js you'll find the settings for the GPT's initial context and prompt. For example:

this.userContext = [
      { "role": "system", "content": "You are an outbound sales representative selling Apple Airpods. You have a youthful and cheery personality. Keep your responses as brief as possible but make every attempt to keep the caller on the phone without being rude. Don't ask more than 1 question at a time. Don't make assumptions about what values to plug into functions. Ask for clarification if a user request is ambiguous. Speak out all prices to include the currency. Please help them decide between the airpods, airpods pro and airpods max by asking questions like 'Do you prefer headphones that go in your ear or over the ear?'. If they are trying to choose between the airpods and airpods pro try asking them if they need noise canceling. Once you know which model they would like ask them how many they would like to purchase and try to get them to place an order. Add a '•' symbol every 5 to 10 words at natural pauses where your response can be split for text to speech." },
      { "role": "assistant", "content": "Hello! I understand you're looking for a pair of AirPods, is that correct?" },
    ],

About the `system` Attribute

The system attribute is background information for the GPT. As you build your use-case, play around with modifying the context. A good starting point would be to imagine training a new employee on their first day and giving them the basics of how to help a customer.

There are some context prompts that will likely be helpful to include by default. For example:

You have a [cheerful, wise, empathetic, etc.] personality.
Keep your responses as brief as possible but make every attempt to keep the caller on the phone without being rude.
Don't ask more than 1 question at a time.
Don't make assumptions about what values to plug into functions.
Ask for clarification if a user request is ambiguous.
Add a '•' symbol every 5 to 10 words at natural pauses where your response can be split for text to speech.

These context items help shape a GPT so that it will act more naturally in a phone conversation.

The • symbol context in particular is helpful for the app to be able to break sentences into natural chunks. This speeds up text-to-speech processing so that users hear audio faster.

About the `content` Attribute

This attribute is relatively simple, it is your default conversations starter for the GPT. However, you could consider making it more complex and customized based on personalized user data.

In this case, our bot will start off by saying, "Hello! I understand you're looking for a pair of AirPods, is that correct?"

Adding Custom Function Calls

You can have your GPT call external data sources by adding functions to the /functions directory. Follow these steps:

Create a function (e.g. checkInventory.js in /functions)
Within checkInventory.js, write a function called checkInventory.
Add information about your function to the function-manifest.js file. This information provides context to GPT about what arguments the function takes.

Important: Your function's name must be the same as the file name that contains the function (excluding the .js extension). For example, our function is called checkInventory so we have named the the file checkInventory.js, and set the name attribute in function-manifest.js to be checkInventory.

Example function manifest entry:

{
  type: "function",
  function: {
    name: "checkInventory",
    description: "Check the inventory of airpods, airpods pro or airpods max.",
    parameters: {
      type: "object",
      properties: {
        model: {
          type: "string",
          "enum": ["airpods", "airpods pro", "airpods max"],
          description: "The model of airpods, either the airpods, airpods pro or airpods max",
        },
      },
      required: ["model"],
    },
    returns: {
      type: "object",
      properties: {
        stock: {
          type: "integer",
          description: "An integer containing how many of the model are in currently in stock."
        }
      }
    }
  },
}

Receiving Function Arguments

When ChatGPT calls a function, it will provide an object with multiple attributes as a single argument. The parameters included in the object are based on the definition in your function-manifest.js file.

In the checkInventory example above, model is a required argument, so the data passed to the function will be an object like this:

{
  model: "airpods pro"
}

For our placeOrder function, the arguments passed will look like this:

{
  model: "airpods pro",
  quantity: 10
}

Deploy via Fly.io

Fly.io is a hosting service similar to Heroku that simplifies the deployment process. Given Twilio Media Streams are sent and received from us-east-1, it's recommended to choose Fly's Ashburn, VA (IAD) region.

Deploying to Fly.io is not required to try the app, but can be helpful if your home internet speed is variable.

Modify the app name fly.toml to be a unique value (this must be globally unique).

Deploy the app using the Fly.io CLI:

fly launch

fly deploy

Import your secrets from your .env file to your deployed app:

fly secrets import < .env

Lessons Learned

Training the AI is challenging. It would be a major value add if Twilio could offer a connector to Open AI, similar to Google Dialogflow, to reduce complexity for developers.
Flex Unify is time consuming to configure and current functionality is more easily achieved via calling the Segment APIs directly.
It would be helpful to have an easier way to call Studio from a Flex Plugin.
The ability to surface Segment data via a Studio widget would save developers time.

khurananick / call-gpt

Call GPT: Generative AI Phone Calling

Setting up for Development

Application Workflow

Modifying the ChatGPT Context & Prompt

About the `system` Attribute

About the `content` Attribute

Adding Custom Function Calls

Receiving Function Arguments

Deploy via Fly.io

Lessons Learned

About

Languages

Call GPT: Generative AI Phone Calling

Setting up for Development

Application Workflow

Modifying the ChatGPT Context & Prompt

About the system Attribute

About the content Attribute

Adding Custom Function Calls

Receiving Function Arguments

Deploy via Fly.io

Lessons Learned

About

Languages

About the `system` Attribute

About the `content` Attribute