Based on how many questions I'm getting about LLM transcription, it seems that Multimodal LLMs (like Gemini) being cheaper and better at straight up transcription has been a secret (not just for me).
Collecting useful ideas, questions here so it's all in one place 👉
x.com/hrishioa/status/1846222504018563210
Segmenting to under 128k tokens makes things a lot cheaper (at least for Gemini). Additionally, splitting on silence (use ffmpeg or audacity) into <10minute chunks significantly improves quality - my uneducated guess is it has to do with how Gemini does attention.
x.com/nikshepsvn/status/1846240525994569756
100%! With a pre-processing step (also using Gemini), you can partially automate this by pulling out context, title, topics of discussion, commonly misspelled names, etc.
x.com/rez0__/status/1846243414087483460
The instruct capability is the truly underexplored, powerful thing. Even something as simple as 'fix reasonable mistakes in transcription' reduces error rates by as much as 10% in my testing.
On top, you can add in proper nouns for your weirdly spelled startup, and so much more.
x.com/elliotdohm/status/1846253203417190522
Unfortunately open-source multimodal models aren't there just yet. You could possibly boost the quality by using multiple sources like whisper, diart, etc and then pulling them together with a good model like Gemma (which lets you carry over some prompting) - I haven't tried yet.
x.com/rishav_sapahia/status/1846245656828010893
Overlapping speakers are tough - especially for me because I tend to ADD interrupt a lot 😅
Whisper will usually pick up one part of the full sentence, so maybe there's something there to merge with the diarized output?
x.com/zhanghaoxxxx/status/1846098420567810075
This is awesome! Yeah doing sequential passes while keeping previous context is a big boost to quality at the tradeoff of cost which keeps coming down.
x.com/meekaale/status/1846259389130256838
Without other references I think it finds it hard to tell. I did once try giving it a 10 second clip of clean audio and it reasonably pointed out problems with my audio.
Wasn't sure then what I would use it for!
x.com/randallb/status/1846260048856592484