Audio frames streaming and OPUS packets

Question

Audio frames streaming and OPUS packets

JustNello opened this issue 8 months ago · comments

Hello,
thanks for this project :)

I'd like to transcribe an audio track with Deepgram, but I have some issues.

The application
The client is made of LiveKit React Components (ie. LiveKitRoom and AudioConference), with REDundant encoding disabled when a client joins a room, as described in the docs.

The server uses this Python SDK and it has been implemented starting from the Whisper example. In my case, the "whisper_task" has been replaced by a "deepgram_task", in this gist.

Issue
I think I'm not getting how the AudioFrame (from rtc.AudioFrame package) encodes data. I'm new to audio streaming at all and this may be the cause of the issue. I know that the audio format is OPUS, but:

Is either a containerized audio stream or raw audio stream? I refer to this
What is the frame duration? Deepgram requires that the streaming buffer sizes should be between 20 milliseconds and 250 milliseconds of audio

In other words, what does bytes(frame.data) return? Is the OPUS packet?
I'm not able to inspect the packet using a packet inspector.

Thank you in advance for any help you may give,
Luca

Théo Monnom · Answer 1 · Tue Nov 14 2023 02:02:41 GMT+0800 (China Standard Time)

Hey Luca!
The frames you receive from the AudioStream are raw signed PCM.
Looking at the docs from Deepgram, they do support linear16.
I've already used Deepgram before, I think you can just connect to their websocket and send the frames you receive from livekit directly. (Also don't forget to use the right sample rate)

JustNello · Answer 2 · Tue Nov 14 2023 21:28:35 GMT+0800 (China Standard Time)

Awesome, it works 😀

One last question to improve my understanding: rtc.RemoteTrackPublication.mime_type yields audio/opus. When is the audio converted to signed PCM?

Théo Monnom · Answer 3 · Wed Nov 15 2023 02:29:12 GMT+0800 (China Standard Time)

The mime_type represents the codec utilized during the transmission of media to the recipient. Upon receipt, libwebrtc will immediately decode this media.