Is Gemini Pro better than GPT-4? What can we learn from a multi-dataset comparison?
First, what does it cost?
Put it this way: With "Language models are few shot learners" as input, and the Gettysburg address as the output, here's what it would cost against the 16K version of GPT-3.5.
Input: Gemini Pro ($0.059), GPT-3.5 ($0.064)
Output: Gemini Pro ($0.000906), GPT-3.5 ($0.00105)
Ouch.
But there's a lot to be learned here. The paper goes into great detail (with data) about performance across multiple domains.
Here are my takeaways:
1. Evidence suggests to me that Gemini is simply a very different model. 3.5, 4 and even Mixtral often have the same behavior patterns with output length, task type, etc. while Gemini just looks very different.
This could be architecture, could also be pretraining.
It could also just be prompting - most existing prompt literature (as well as quite a few benchmarks) are tuned for OpenAI's GPT, simply because it was one of the first and cheapest to exist.
I've observed differences with Claude this way.
Unfortunately without too compelling a model, or doing this work yourself, it's hard to figure out what the right way to prompt Gemini would be compared to GPT-3.5.
So is it worth using? Let's check the data.
Gemini is pretty good at translation (on the languages it supports), but on most other things you're probably right to presume it's worse or equal to gpt-3.5.
It's pretty egregiously bad at agent work - which is really perplexing.
In an age where we suspect OpenAI is finetuning models and retraining to almost always go through a Chain of Thought, Gemini seems to want to skip over reasoning even when prompted. It also marks a strangely high number of tasks as unachievable, perhaps due to lopsided training?
What's wonderful about this paper is that they provide proper pages with all the data and interactive graphs for the evaluation. I'll link each.
First, MMLU
Weird: on MCQs Gemini is biased to selecting D - likely lack of specific tuning for MCQs.
hub.zenoml.com/report/2674/Gemini%20MMLU
This is really interesting. Not sure if it's finetuning, but GPT-4 is suggested to go into CoT unprompted.
Also interesting that Gemini underperforms GPT-3.5 on most tasks, except things like College & High School Bio, MacroEc and Security Studies. GPT-4 still wins
💡Something new to me: They use the length of the Chain-of-Thought segment as a proxy for reasoning complexity. Seems all models degrade on accuracy as this increases. GPT-4 wins, but Gemini degrades the least.
They label this as "Gemini handles more complex reasoning chains"
Not fully convinced. Models can be verbose in reasoning for other reasons, as suggested by the difference in output length distribution.
Nevertheless it's a useful finding overall & a quick and dirty way to measure task complexity in CoT situations. Definitely using that!
Reasoning: BIG-Bench Hard
Once again they use question length as a proxy for complexity. Not sure how accurate this is, but the results are useful even if we read them based on input length (question verbosity + complexity + data).
hub.zenoml.com/report/2575/Gemini%20BBH
Gemini accuracy degraded heavily on longer questions, while the others did not. Mixtral and GPT-4 are notable for going up around the middle, might be a result of MoE.
We also have our first instance of Gemini beating GPT-4: Sports understanding.
Not sure where that's useful but good to know I guess
Another reason to take benchmark results with a big grain of salt: Evaluating responses is still pretty hard to do. Models will often give the right or wrong answer, without respecting the format. Proper LLM-based evals (finetuned models) might be the only solution here.
Next is code: hub.zenoml.com/report/2641/Gemini%20Code
Same story as before, mostly worse or equal to 3.5. However, Gemini gets almost linearly worse by output length, and the previous win on I/O length doesn't hold here.
Man I'm still amazed that Mixtral actually punches in this weight class
Next is Machine Translation. Gemini is good when it works, but it's also likely being held back by outright blocking of certain languages.
This might have been a way of adjusting reported performance by blocking low performing languages in the model perhaps?
Agents: hub.zenoml.com/report/2608/Gemini%20Webarena
Gemini tends to label tasks as unachievable, way more than GPTs. It also tended to give shorter responses and terminate in way shorter steps than GPT or even Mistral. It also tends to skip reasoning, even when asked to 'think step-by-step'.
This is really weird to me. In an age where we suspect OpenAI is finetuning models and retraining to almost always go through a Chain of Thought, Gemini seems to want to skip over that part even when prompted, and overall refuse to try to solve problems.
Agree with most of the conclusions, but from looking at the actual questions and answers in the dataset I'm not sure I agree with 3. I think length might be a poor proxy for complexity, in both the input and the output.
Hope that was helpful! Clearing out my bookmarks folder for the year has been a journey, I'm about 2% through. I'll post them as I go through them on Twitter.
Here's the full paper, it's well worth the read.
arxiv.org/abs/2312.11444