AllYourBot / hostedgpt

An open version of ChatGPT you can host anywhere or run locally.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Voice polish

krschacht opened this issue · comments

  • SMALL BUT EASY: The stop icon flashes briefly after submitting the first message (this is because it's doing a Turbo Drive navigate rather than a Turbo Morph, see if I can detect this). NO EASY WAY TO FIX
  • SMALL BUT EASY: When the user is enabling the mic and is asked about enabling the screen sharing, the mic actually turns on before they've answered the screen sharing question. Figure out where I missed an await.
  • IMPORTANT: Add sound effects so the user can tell when the AI is waiting, thinking, etc.
  • SMALL: It keeps playing the typing sound when you've suspended it. Also, the typing sound needs to be randomized a bit more to sound realistic. It's too repetitive.
  • IMPORTANT: Wrap the system prompt with voice-specific instructions
  • IMPORTANT: Route all the requests through a proxy
  • Decide how I should handle when markdown code blocks appear in the response
  • IMPORTANT: Decrease latency so the user gets a faster reply. My best idea is this. Current it waits for 1.8s of silence before submitting the response. Instead, only wait for maybe 0.2 - 0.5s of silence, after that assume go ahead and submit the message to the API, but then if we detect the user starts talking again before we've received and started playing back the response edit the message we submitted with these additional words and re-submit it (this will take advantage of conversation branching which is already in place). Note: we need to confirm the previous reply job will be cancelled when this revised message is submitted.
  • IMPORTANT: Add full test coverage and get CI to run the javascript tests
  • FUTURE: When there is a lot of background noise (e.g. music playing) then the system never detects any silence. Maybe we can establish a baseline noise level and re-define "silence" as a return to baseline noise level and/or we can submit responses after X seconds even if we don't detect any silence.
  • FUTURE: Right now a wake word (e.g. the assistant's name) is not needed. By default, the assistant is always listening unless you've dismissed it by by saying "Hold on Samantha". However, more lifelike would be for the assistant to dismiss itself after X minutes of you not talking to it. If someone is sitting in the room with us but we started ignoring them for ~30 minutes then we naturally re-address them if we start talking again.
  • FUTURE: Is there a heuristic for knowing when it’s overhearing a conversation between two people vs when you’re talking to it?

maybe you could add a model like Silero VAD ?

@lumpidu That one is new to me. Thanks for the tip. Btw, if you want to try it out my branch is working pretty well. I’m just going to add automated tests and do a little more polish before merging in.

@krschacht you could integrate the model into either the backend via https://github.com/ankane/onnxruntime-ruby, or even into the frontend: https://onnxruntime.ai/docs/api/js/index.html, demo for browser: https://github.com/ricky0123/vad. I will definitely try out your project !

@lumpidu This is really cool. I was not aware of client-side models like this for voice detection that could be run in this way. I wonder if it's using the new WebAssembly under the hood.

I don't think I'd prioritize this in the near-term. In case you haven't seen, yesterday I merged in a v1 of the voice mode: #348

I just updated my "voice polish" to-do list at the top of this task based on where I left off yesterday. But one notable thing is that OpenAI just announced that they have this incredible new voice model which is going to be released "soon". I'm not sure if soon is a couple weeks or a couple months :) but I will probably, intentionally, defer some of these tasks until after I can evaluate that. However, I'm using this voice mode daily now myself so I'm going to keep polishing it so that I can enjoy using it while I wait.

If you're interested in helping with any of this, let me know! I can suggest good tasks, and I can help you ramp up on the implementation. I welcome help! :)