On the Use of Large Language Models at Solving Math Problems: A Comparison Between GPT-4, LlaMA-2 and Gemini

Alejandro L. García Navarro; Nataliia Koneva; José Alberto Hernández; Alfonso Sánchez-Macián

doi:10.9781/ijimai.2025.03.001

Authors

Alejandro L. García Navarro Universidad Carlos III de Madrid
Nataliia Koneva Universidad Carlos III de Madrid
José Alberto Hernández Universidad Carlos III de Madrid
Alfonso Sánchez-Macián Universidad Carlos III de Madrid

DOI:

https://doi.org/10.9781/ijimai.2025.03.001

Keywords:

ChatGPT, Generative AI, Mathematical Problems, Wolfram Mathematica

Supporting Agencies

The authors would like to acknowledge the support of project 6G-INTEGRATION-3 (grant no. TSI-063000-2021-127), funded by UNICO-5G program (under the Next Generation EU umbrella funds) and ITACA (PDC2022-133888-I00) funded by the Spanish Agencia Estatal de Investigacion (AEI).

Abstract

In November 2022, ChatGPT v3.5 was announced to the world. Since then, Generative Artificial Intelligence (GAI) has appeared in the news almost daily, showing impressive capabilities at solving multiple tasks that have surprised the research community and the world in general. Indeed the number of tasks that ChatGPT and other Large Language Models (LLMs) can do are unimaginable, especially when dealing with natural text. Text generation, summarisation, translation, and transformation (into poems, songs, or other styles) are some of its strengths. However, when it comes to reasoning or mathematical calculations, ChatGPT finds difficulties. In this work, we compare different flavors of ChatGPT (v3.5, v4, and Wolfram GPT) at solving 20 mathematical tasks, from high school and first-year engineering courses. We show that GPT-4 is far more powerful than ChatGPT-3.5, and further that the use of Wolfram GPT can even slightly improve the results obtained with GPT-4 at these mathematical tasks.

Downloads

Download data is not yet available.

References

E. K. Jermakowicz, “The coming transformative impact of large language models and artificial intelligence on global business and education,” Journal of Global Awareness, vol. 4, no. 2, pp. 1–22, 2023.

A. Sherstinsky, “Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network,” Physica D: Nonlinear Phenomena, vol. 404, p. 132306, Mar. 2020, doi: 10.1016/j.physd.2019.132306.

R. C. Staudemeyer, E. R. Morris, “Understanding lstm – a tutorial into long short-term memory recurrent neural networks,” 2019. [Online]. Available: https://arxiv.org/abs/1909.09586.

S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, J. Gao, “Large language models: A survey,” 2024. [Online]. Available: https://arxiv.org/abs/2402.06196.

K. Jing, J. Xu, “A survey on neural network language models,” 2019. [Online]. Available: https://arxiv.org/abs/1906.03591.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, “Attention is all you need,” 2023. [Online]. Available: https://arxiv.org/abs/1706.03762.

OpenAI, “Chatgpt,” 2022. [Online]. Available: https://openai.com/blog/chatgpt, Accessed: 23/03/2024.

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019. [Online]. Available: https://arxiv.org/abs/1810.04805.

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, “Language models are few-shot learners,” 2020. [Online]. Available: https://arxiv.org/abs/2005.14165.

J. Zhou, P. Ke, X. Qiu, M. Huang, J. Zhang, “Chatgpt: potential, prospects, and limitations,” Frontiers of Information Technology & Electronic Engineering, vol. 25, no. 1, pp. 6–11, 2024, doi: 10.1631/FITEE.2300089.

OpenAI, J. Achiam, et al., “Gpt-4 technical report,” 2024. [Online]. Available: https://arxiv.org/abs/2303.08774.

Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V. Do, Y. Xu, P. Fung, “A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity,” 2023. [Online]. Available: https://arxiv.org/abs/2302.04023.

V. Parra, P. Sureda, A. Corica, S. Schiaffino, D. Godoy, “Can generative ai solve geometry problems? strengths and weaknesses of llms for geometric reasoning in spanish,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 8, pp. 65–74, 03/2024 2024, doi: 10.9781/ijimai.2024.02.009.

N. R. Téllez, P. R. Villela, R. B. Bautista, “Evaluating chatgpt-generated linear algebra formative assessments,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 8, pp. 75–82, 03/2024 2024, doi: 10.9781/ijimai.2024.02.004.

J. A. Hernández, J. Conde, B. Querol, G. Martínez, P. Reviriego, ChatGPT Tus primeros prompts con 100 ejemplos. Amazon Kindle Direct Publishing, December 2023. Internet de Nueva Generación.

W. Holmes, M. Bialik, C. Fadel, Artificial Intelligence in Education. Promise and Implications for Teaching and Learning. Center for Curriculum Redesign, Mar. 2019.

M. Alier, F.-J. García-Peñalvo, J. D. Camba, “Generative artificial intelligence in education: From deceptive to disruptive,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 8, pp. 5–14, 03/2024 2024, doi: 10.9781/ijimai.2024.02.011.

J. Izquierdo-Domenech, J. Linares-Pellicer, I. Ferri- Molla, “Virtual reality and language models, a new frontier in learning,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 8, pp. 46–54, 03/2024 2024, doi: 10.9781/ijimai.2024.02.007.

OpenAI, “Winding down the chatgpt plugins beta,” 2024. [Online]. Available: https://help.openai.com/en/articles/8988022-winding-downthe-chatgpt-plugins-beta, Accessed: 2024-07-24.

OpenAI, “Introducing gpts,” 2023. [Online]. Available: https://openai.com/index/introducing-gpts/, Accessed: 2024-07-24.

Wolfram, “Wolfram plugin for chatgpt,” 2024. [Online]. Available: https://www.wolfram.com/wolfram-plugin-chatgpt/, Accessed: 2024-07-24.

S. Frieder, L. Pinchetti, R.-R. Griffiths, T. Salvatori, T. Lukasiewicz, P. Petersen, A. Chevalier, J. Berner, “Mathematical capabilities of chatgpt,” 01 2023. [Online]. Available: https://arxiv.org/abs/2301.13867.

A. Cherian, K.-C. Peng, S. Lohit, J. Matthiesen, K. Smith, J. B. Tenenbaum, “Evaluating large vision-and-language models on children’s mathematical olympiads,” 2024. [Online]. Available: https://arxiv.org/abs/2406.15736.

J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, W. Yin, “Large language models for mathematical reasoning: Progresses and challenges,” 2024. [Online]. Available: https://arxiv.org/abs/2402.00157.

S. Imani, L. Du, H. Shrivastava, “Mathprompter: Mathematical reasoning using large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2303.05398.

A. Zhou, K. Wang, Z. Lu, W. Shi, S. Luo, Z. Qin, S. Lu, A. Jia, L. Song, M. Zhan, H. Li, “Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification,” 2023. [Online]. Available: https://arxiv.org/abs/2308.07921.

E. Davis, S. Aaronson, “Testing gpt-4 with wolfram alpha and code interpreter plug-ins on math and science problems,” 2023. [Online]. Available: https://arxiv.org/abs/2308.05713.

P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Computing Surveys, vol. 55, no. 9, pp. 1–35, 2023.

Y. Zhang, H. Fei, D. Li, P. Li, “PromptGen: Automatically generate prompts using generative models,” in Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, United States, July 2022, pp. 30–37, Association for Computational Linguistics.

J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, D. C. Schmidt, “A prompt pattern catalog to enhance prompt engineering with chatgpt,” 2023. [Online]. Available: https://arxiv.org/abs/2302.11382.

Z. Luo, Q. Xie, S. Ananiadou, “Chatgpt as a factual inconsistency evaluator for text summarization,” 2023. [Online]. Available: https://arxiv.org/abs/2303.15621.

H. Larochelle, D. Erhan, Y. Bengio, “Zero-data learning of new tasks.,” in Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08, 01 2008, pp. 646–651, AAAI Press.

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, “Chain- of-thought prompting elicits reasoning in large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2201.11903.

A. Jojic, Z. Wang, N. Jojic, “Gpt is becoming a turing machine: Here are some ways to program it,” 2023. [Online]. Available: https://arxiv.org/abs/2303.14310.

G. Team, R. Anil, et al., “Gemini: A family of highly capable multimodal models,” 2024. [Online]. Available: https://arxiv.org/abs/2312.11805.

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, “Llama: Open and efficient foundation language models,” 2023. [Online]. Available: https://arxiv.org/abs/2302.13971.