Generative AI is producing a bunch of fun new models for us devs to poke at. Did you know you can use these over the phone?
Twilio gives you a superpower called Media Streams. Media Streams provides a Websocket connection to both sides of a phone call. You can get audio streamed to you, process it, and send audio back.
This repo serves as a demo exploring three models:
- Deepgram for Speech to Text
- elevenlabs for Text to Speech
- OpenAI for GPT prompt completion
These service combine to create a voice application that is remarkably better at transcribing, understanding, and speaking than traditional IVR systems.
Features:
- Returns responses with low latency, typically 1 second by utilizing streaming.
- Allows the user to interrupt the GPT assistant and ask a different question.
- Maintains chat history with GPT.
Sign up for Deepgram, ElevenLabs, and OpenAI. You'll need an API key for each service.
Use ngrok to tunnel and then expose port 3000
ngrok http 3000
Copy .env.example
to .env
and add all API keys.
Set SERVER
to your tunneled ngrok URL
Install the necessary packages:
npm install
Start the web server:
npm run dev
Wire up your Twilio number using the console or CLI
twilio phone-numbers:update +1[your-twilio-number] --voice-url=https://your-server.ngrok.io/incoming
There is a Stream TwiML verb that will connect a stream to your websocket server.
CallGPT coordinates the data flow between multiple different services including Deepgram, OpenAI, ElevenLabs, and Twilio Media Streams:
Within gpt-service.js
you'll find the settings for the GPT's initial context and prompt. For example:
this.userContext = [
{ "role": "system", "content": "You are an outbound sales representative selling Apple Airpods. You have a youthful and cheery personality. Keep your responses as brief as possible but make every attempt to keep the caller on the phone without being rude. Don't ask more than 1 question at a time. Don't make assumptions about what values to plug into functions. Ask for clarification if a user request is ambiguous. Speak out all prices to include the currency. Please help them decide between the airpods, airpods pro and airpods max by asking questions like 'Do you prefer headphones that go in your ear or over the ear?'. If they are trying to choose between the airpods and airpods pro try asking them if they need noise canceling. Once you know which model they would like ask them how many they would like to purchase and try to get them to place an order. Add a '•' symbol every 5 to 10 words at natural pauses where your response can be split for text to speech." },
{ "role": "assistant", "content": "Hello! I understand you're looking for a pair of AirPods, is that correct?" },
],
The system
attribute is background information for the GPT. As you build your use-case, play around with modifying the context. A good starting point would be to imagine training a new employee on their first day and giving them the basics of how to help a customer.
There are some context prompts that will likely be helpful to include by default. For example:
- You have a [cheerful, wise, empathetic, etc.] personality.
- Keep your responses as brief as possible but make every attempt to keep the caller on the phone without being rude.
- Don't ask more than 1 question at a time.
- Don't make assumptions about what values to plug into functions.
- Ask for clarification if a user request is ambiguous.
- Add a '•' symbol every 5 to 10 words at natural pauses where your response can be split for text to speech.
These context items help shape a GPT so that it will act more naturally in a phone conversation.
The •
symbol context in particular is helpful for the app to be able to break sentences into natural chunks. This speeds up text-to-speech processing so that users hear audio faster.
This attribute is relatively simple, it is your default conversations starter for the GPT. However, you could consider making it more complex and customized based on personalized user data.
In this case, our bot will start off by saying, "Hello! I understand you're looking for a pair of AirPods, is that correct?"
You can have your GPT call external data sources by adding functions to the /functions
directory. Follow these steps:
- Create a function (e.g.
checkInventory.js
in/functions
) - Within
checkInventory.js
, write a function calledcheckInventory
. - Add information about your function to the
function-manifest.js
file. This information provides context to GPT about what arguments the function takes.
Important: Your function's name must be the same as the file name that contains the function (excluding the .js extension). For example, our function is called checkInventory
so we have named the the file checkInventory.js
, and set the name
attribute in function-manifest.js
to be checkInventory
.
Example function manifest entry:
{
type: "function",
function: {
name: "checkInventory",
description: "Check the inventory of airpods, airpods pro or airpods max.",
parameters: {
type: "object",
properties: {
model: {
type: "string",
"enum": ["airpods", "airpods pro", "airpods max"],
description: "The model of airpods, either the airpods, airpods pro or airpods max",
},
},
required: ["model"],
},
returns: {
type: "object",
properties: {
stock: {
type: "integer",
description: "An integer containing how many of the model are in currently in stock."
}
}
}
},
}
When ChatGPT calls a function, it will provide an object with multiple attributes as a single argument. The parameters included in the object are based on the definition in your function-manifest.js
file.
In the checkInventory
example above, model
is a required argument, so the data passed to the function will be an object like this:
{
model: "airpods pro"
}
For our placeOrder
function, the arguments passed will look like this:
{
model: "airpods pro",
quantity: 10
}
Fly.io is a hosting service similar to Heroku that simplifies the deployment process. Given Twilio Media Streams are sent and received from us-east-1, it's recommended to choose Fly's Ashburn, VA (IAD) region.
Deploying to Fly.io is not required to try the app, but can be helpful if your home internet speed is variable.
Modify the app name fly.toml
to be a unique value (this must be globally unique).
Deploy the app using the Fly.io CLI:
fly launch
fly deploy
Import your secrets from your .env file to your deployed app:
fly secrets import < .env
- Training the AI is challenging. It would be a major value add if Twilio could offer a connector to Open AI, similar to Google Dialogflow, to reduce complexity for developers.
- Flex Unify is time consuming to configure and current functionality is more easily achieved via calling the Segment APIs directly.
- It would be helpful to have an easier way to call Studio from a Flex Plugin.
- The ability to surface Segment data via a Studio widget would save developers time.