Subtitle generation is today a crucial productivity tool used at work in meetings, lectures, webinars and videos. Subtitles are generally generated via Artificial Intelligence by automatic audio transcription services, but currently only very few platforms are capable of processing audio in real-time and directly from outgoing speakers, as well as being compatible with all meeting software. and video applications.
How closed captions are generated through Artificial Intelligence
In the past, closed captions were generated by human transcriptionists. They would listen to the audio and type out the text in a word document. Nowadays, machine learning algorithms are able to generate captions that are of high quality and at a much faster rate. The algorithms analyze the audio and can generate the text in real-time. The process is not perfect but it is still much faster than having a human transcribe every word spoken in the audio.
Indeed, a transcription software analyzes media that contains speech and, by leveraging advanced machine learning algorithms based on Neural Networks, transcribes the audio data into text.
Transcriptions generation can be generally achieved through two different approaches::
- Batch transcription: it consists in processing a media file placed in a specific location; the results are provided after the elaboration time which usually depends on the size of the media.
- Streaming transcriptions: the audio played by a device is processed in streaming and the transcription is generated in real-time.
Many technology platform vendors, including public cloud providers, like AWS, Google, and Microsofts, offer their transcription software via APIs.
Examples are AWS Transcribe, Google Speech-to-text API, and Microsoft Azure.
What is the difference between real-time captions and batch transcriptions?
Real-time transcription is a process of converting streams of spoken sentences into text in real-time. The software monitors the audio signal and transcribes it as the speaker speaks.
The Batch transcriptions are done after the event has taken place, where the audio is recorded and transcribed at a later date. The batch software elaborates a media (audio) file and provides the transcripts after the processing is finished. And the execution time depends on the length of the media file.
The need for real-time captions is more common than batch transcription because they are usually used for live events such as business meetings, lectures, or seminars.
However currently only very few platforms offer real-time audio streaming transcription services. Specifically, most of the transcription services in the market only provide batch transcriptions, that is, transcripts from an audio file that the user uploads upfront. And they offer real-time transcripts only for dictation services (that is, transcribing our own voice in the input microphone) or within their platform environment.
Let’s see some examples.
Batch transcriptions are easy: the user uploads an audio/video file, submits the requests and receives, after some time, the transcripts. As we know, the processing time depends on the size of the file.
Trint, Sonix, Rev, Otranscribe, Temi, and more, offer this type of service to the end consumers. Also, the vocal messages transcribed in Whatsapp are a sort of batch transcription: you register a very short vocal message, and it elaborates in one second the result.
Real-Time transcriptions are more challenging: it is about the ability to capture streaming audio and process it in real-time. In this case, you can get the transcriptions or live captioning almost simultaneously to the audio or video. This is especially useful during meetings and calls, lectures and webinars, interviews, videos, and similar cases.
Real-Time meetings captions with one single Transcription Software
As said, the majority of the Software offers real-time captions only among their own platform users. Large adoption of real-time transcriptions is with the collaboration and meeting tools, like Zoom, Microsoft Teams, Skype, Webex, Google Meet, Otter.ai.
But they all have some limitations:
- Not all of them provide the feature of live captioning during calls or meetings.
- When available (e.g. in MS Teams), the live captions are restricted to the users of the same organization and they all need to have the software installed.
- The software doesn’t retain or store the transcripts for a later review or reading. They just go away with the next sentence.
- The real-time transcriptions solutions only capture the audio from the microphone. That is, only your own speech, not the one from the others.
The ideal solution for real-time transcriptions of audio streams is to have one single software, compatible with any other applications.
The key feature is the ability to process the audio played by the output speakers, not only the voice in input to the microphone.
In this way, the transcription software is able to process the audio regardless of the application used for meetings or video. One Transcriber provides exactly such a feature. Therefore it’s always compatible and works with any audio generated by the output speakers.