The king is dead, long live the king!
There's a successor to Whisper - even though it's not as well named
SeamlessM4T (demos linked below) is a single model that can do almost all combinations of speech, text and translation, with a 2.3B large and a 281M version for on-device.
Being that it's only spoken by 30M people, Malayalam is usually my go-to for testing multi-language support. The posted demo picked up most intonations, sentence structures, and complex wordings I could throw at it 🤯
seamless.metademolab.com/
The HF repos are here - huggingface.co/models?search=facebook/seamless-m4t
Given that this is unoptimized, I'm excited for the new projects that are about to make this faster. Every day we're getting closer to having completely on-device babelfish, and a little closer to a universal translator.
The results are a lot better than Whisper, but the large M4T model is also twice as large as whisper-v2. The benefits also seem to come from better labelled and aligned speech (and translation) data in training, rather than just model size.
The complexity of these processing pipelines (which AFAIK meta haven't open-sourced) might be why we don't see many weights-level improvements to these models from OSS.
Makes it a big negative that the weights are non-commercially licensed. Remains to be seen if this will change
What should be applauded is the increasing detail in Meta's papers outlining the inner workings. I'll be testing out the model in the coming weeks, and sharing what I learn from the research!