Gemini Transcription Tips and FAQs

•

Based on how many questions I'm getting about LLM transcription, it seems that Multimodal LLMs (like Gemini) being cheaper and better at straight up transcription has been a secret (not just for me). Collecting useful ideas, questions here so it's all in one place 👉 x.com/hrishioa/status/1846222504018563210

Segmenting to under 128k tokens makes things a lot cheaper (at least for Gemini). Additionally, splitting on silence (use ffmpeg or audacity) into <10minute chunks significantly improves quality - my uneducated guess is it has to do with how Gemini does attention. x.com/nikshepsvn/status/1846240525994569756

100%! With a pre-processing step (also using Gemini), you can partially automate this by pulling out context, title, topics of discussion, commonly misspelled names, etc. x.com/rez0__/status/1846243414087483460

The instruct capability is the truly underexplored, powerful thing. Even something as simple as 'fix reasonable mistakes in transcription' reduces error rates by as much as 10% in my testing. On top, you can add in proper nouns for your weirdly spelled startup, and so much more. x.com/elliotdohm/status/1846253203417190522

What I'm trying - with reasonable success - is matching whisper-tiny timestamps (which are super cheap to make and pretty accurate, even in the browser) to the transcription. You can use a similar diff algorithm (like the one in the repo (github.com/SouthBridgeAI/llm-transcription-study) to match text. x.com/deepwhitman/status/1846241947897520561

OCR needs to be explored more. LLMs aided by multiple OCR sources has so far turned out to be an easy way to trade cost against quality: southbridge-research.notion.site/OCR-with-GOT-and-Sonnet-1185fec70db180b2b4c7f0a59e053f98 x.com/tarekayed00/status/1846241983284576348

The best results I get from audacity for noise removal: support.audacityteam.org/repairing-audio/noise-reduction-removal There are also smaller packages that do a good job of automating this, but nothing has so far beat Audacity x.com/ben_rapaport/status/1846242113861628403

Unfortunately open-source multimodal models aren't there just yet. You could possibly boost the quality by using multiple sources like whisper, diart, etc and then pulling them together with a good model like Gemma (which lets you carry over some prompting) - I haven't tried yet. x.com/rishav_sapahia/status/1846245656828010893

Overlapping speakers are tough - especially for me because I tend to ADD interrupt a lot 😅 Whisper will usually pick up one part of the full sentence, so maybe there's something there to merge with the diarized output? x.com/zhanghaoxxxx/status/1846098420567810075

This is awesome! Yeah doing sequential passes while keeping previous context is a big boost to quality at the tradeoff of cost which keeps coming down. x.com/meekaale/status/1846259389130256838

Without other references I think it finds it hard to tell. I did once try giving it a 10 second clip of clean audio and it reasonably pointed out problems with my audio. Wasn't sure then what I would use it for! x.com/randallb/status/1846260048856592484