Wrong twice this week!
I've been suggesting self-consistency as a way to scale compute against accuracy. Turns out it doesn't work and there are better ways.
Way too many useful things in this frankly underrated paper I wish I read sooner👇
The first is about self-consistency, which is just running models at high temperature and looking for consensus among the results.
The argument the paper makes against it has an extra link, but here it is:
1. Deductive paths with Chain of Thought in them are more likely to have a correct answer.
2. Answers from CoT paths often don't represent the predominant answer among potential paths, which means they're more likely to be wrong.
Live and learn I guess - but what excites me about this, is the next thing.
I'd always imagined LLMs as needing CoT prompts to elicit answers, and that you could finetune in responding with CoT tokens.
Turns out that's not always true. Even at the first token, CoT paths usually exist - and it makes sense at larger model sizes. However,
They usually don't if you pick the most probable token. My longstanding guess is that users prefer LLMs and humans that answer immediately without prevaricating.
I've found myself trying to answer quickly instead of wanting to think through a problem.
The improvements (as much as benchmarks are hard to believe these days) make a lot of sense.
What's also amazing is that with CoT, models are more confident about their answers in the token probability. How does it work?
Simple - at the very first token, they run through n parallel paths of decreasing probability. They extract the answer from each, then pick the answer with the highest total confidence value.
This is also a good way to test intrinsic reasoning capabilities (since the prompting is just Q: A:) across models and tasks.
The performance improvement also sticks around, even when you add chain-of-thought prompting 🤯
There are a few things useful to sampling-based approaches here. The first is that early branching (and diversity) in token selection is significantly better than later on. This is different from Entropix, which (in my limited understanding) don't vary temp much against length.
The intrinsic CoT paths also reveals things about models in their base state.
For one, models tend to do math left to right instead of the correct order.
Tey find paths harder the more steps there are or as tasks become more complex. State tracking becomes especially hard.
They also suggest that CoT prompting in a lot of cases just causes the model to 'mimic' the suggestion in the prompt - which can be good and bad depending on what you're trying to do.