Transcriber

A prototype of an application for experimenting with Cocoa APIs used for transcribing audio. The app explores the following Apple frameworks:

Speech
AVFoundation

With the following classes:

SFSpeechRecognizer
AVAudioRecorder
AVAudioPlayer

Don't use this app as an example of good software engineering. There is no proper error handling and methods are not neatly factored into separate files for better overview.

Rational Behind Design

The GUI is mainly designed for experimentation and hacking. That is why it is fairly complex. It is to be able to see a lot of info and be able to do various operations in small steps to see what happens.

Howto Use

This software supports a variety of actions. You can record audio to file and later transcribe. Or you could use an audio file previously recorded with other software and transcribe it.

Either click record or select an audio file with the ... button. When recording you need to click stop to end. You will then get the path to an audio file you just recorded.

Whatever way you got a path to an audio file, you then click transcribe to turn audio into text. Note, this can take some time and program does not currently give any visual feedback on progress.

Intro to Code

At this point this is pretty much a big grayball of mud design. Almost everything is done inside the WinController.swift file. The entrypoint for a Cocoa app is AppDelegate. This is where we initilize the WindController class which manages the main UI. A Window in Cocoa is managed by a subclass of a NSWindowController which handles loading a .xib file describing the UI.

The most important function is startTranscribe which is where we create a SFSpeechURLRecognitionRequest. This follows a fairly standard Cocoa pattern where a request object is created and then it is later passed to a method recognitionTask which carries our the reckognition task. A task if typically a long running task, which you want a callback for when it is done. This callback will get call repeatedly as transcribed text becomes available and you have to check yourself if it is done or not with result.isFinal.

Word, SpokenWord, SpokenDoc

The SFSpeechURLRecognitionRequest will return a SFTranscription object representing our transcription. However this isn't very practical to work with as you cannot mutate objects. Thus we have made the SpokenDoc type to represent transcriptions, where every individual segment in the transcription with a timestamp and duration is represented by SpokenWord.

Table With Transcribed Words - TranscriptionDataSource

SpokenDoc is shown both in a text view transcribedTextView and in a table wordTableView. The table UI is provided data to show through a data source transcriptionDataSource. This data source is the TranscriptionDataSource which wraps our SpokenDoc. In a Model-View-Controller design we can think of NSTableView (transcribedTextView) as the View. The controller is TranscriptionDataSource and SpokenDoc is our Model. The view asks the controller for rows and columns and we implement methods in the controller to pick the requested row and column data out of the SpokenDoc.

ordovician / Transcriber