We don’t develop our own ASR technology (we leave that to the R&D teams at companies like Speechmatics, Amazon and Google) but we do keep track of all the major ASR technologies, and how they perform for captioning use cases, on a regular basis. When delivering our service, we use the best ASR engine currently available as a foundation, and then apply our own technology, expertise and experience to maximise the accuracy. First, we ask our speech-to-text experts in the captioning teams to train the engines regularly with new terms and vocabulary.
There’s an art to this; it’s not as simple as uploading a list of words because text and audio often don’t follow logical pronunciation rules. For example, my surname Gauthier is pronounced “go-tee-ay” but I would guess an English-language trained ASR engine would follow English pronunciation rules and expect it, wrongly, to be pronounced “gaw-th-eer.”
Our captioning teams know all of this and know how to get the best results by optimising vocabulary training to ensure the best chance of an accurate transcription. Second, we apply tens of thousands of bespoke house styles, built by our teams, to improve readability.
ASR engines tend to format everything as text: “COVID nineteen,” “ten thousand five hundred pounds,” “twelve forty-five pm.” Using house styles created by our teams, you get-19”; “£10,500” and “12:45 p.m.”
A much easier reading experience.