math performance of various AI models
AI Developer Experience at Google DeepMind 🔵 prev: Tech Lead at Hugging Face, AWS ML Hero 🤗 Sharing my own views and AI News
Are LLMs really good at Math? A new paper reveals that LLMs have strong performance on individual math problems but struggle with chained problems where the answer to one informs the next. This reasoning gap is larger in smaller, specialized models. 👀 The reasoning gap is the difference between an LLM's expected performance (based on individual question accuracy) and its actual performance on chained reasoning tasks. Tested Models include Google DeepMind Gemini Series (1.5 Pro, 1.0 Pro, 1.5 Flash), OpenAI GPT-4o and mini, Meta Llama 3 (70B and 8B versions), specialized math models (Mathstral-7B, NuminaMath-7B-CoT), Mistral AI, Microsoft Phi and more. 1️⃣ Create pairs of grade-school math problems where the answer to the first (Q1) is needed for the second (Q2), Compositional GSM dataset. 2️⃣ Evaluate LLMs on both individual problems (Q1, Q2 separately) and the combined pairs. 3️⃣ Compare the combined accuracy (accuracy of Q1 * accuracy of Q2) with the expected individuall accuracy ⇒ reasoning gap Insights 💡 LLMs struggle with multi-hop reasoning, leading to a "reasoning gap" in chained math problems. 🤔 The reasoning gap might come from distraction and too much additional context (indicating missing training data?). 📈 Larger LLMs generally perform better than smaller, specialized models 📚 Fine-tuning on grade-school math can lead to overfitting, hindering generalization to chained problems. 💡 Instruction tuning and code generation improvements differ between model sizes. 📊 High scores on standard benchmarks don't reflect true reasoning abilities in multi-step problems. ❌ OpenAI o1 preview or o1 mini were not tested (probably weren't released at that time) Paper: https://lnkd.in/eaNcPdrS