Currently, large language models (LLMs) have demonstrated excellent capabilities in handling various downstream tasks in the field of NLP. In particular, pioneering models such as GPT-4 and ChatGPT have been trained on large amounts of text data, giving them strong text understanding and generation capabilities, enabling them to generate coherent and contextually relevant responses, and are highly versatile in various NLP tasks. However, LLM's performance in mathematical reasoning is not satisfactory. LLM has difficulty accurately performing complex arithmetic operations, especially operations involving multiplication of numbers with more than 8 digits, and operations involving decimals and fractions. Based on this, researchers from Tsinghua University, TAL AI Lab and Zhipu AI jointly proposed a new model - MathGLM, which can perfectly perform complex arithmetic operations. The study shows that with enough training data, the 2 billion-parameter language model can accurately perform multi-digit arithmetic operations with an accuracy of almost 100% and no data leakage. This result far exceeds GPT-4 (whose multi-digit multiplication accuracy is only 4.3%). Method Introduction This paper proposes a model named MathGLM to explore the efficiency of LLM in mathematical reasoning. The arithmetic tasks that need to be completed in the MathGLM model can be roughly divided into two categories: basic arithmetic operations and complex mixed operations. Basic arithmetic operations include basic mathematical tasks that revolve around simple calculations of two numbers. Complex mixed operations involve a combination of different arithmetic operations and number formats (such as integers, decimals, fractions, etc.). Table 1 shows the classification of MathGLM tasks. To enhance the arithmetic capabilities of MathGLM, we adopt a Transformer-based decoder-only architecture and train it from scratch on a generated arithmetic dataset using an autoregressive objective. Learning arithmetic tasks The arithmetic training dataset is carefully designed to include multiple operations such as addition, subtraction, multiplication, division, and exponentiation. In addition, it also contains multiple number formats such as integers, decimals, percentages, fractions, and negative numbers. The datasets vary in size, ranging from 1 million to 50 million records. In each dataset, a single arithmetic expression consists of 2 to 10 steps, covering a range of mathematical operations such as addition (+), subtraction (-), multiplication (×), division (/), and exponentiation (^). Figure 3 shows some training examples extracted from the arithmetic dataset: Table 2 summarizes the different sizes of MathGLM models, including 4 different types of models, each with different parameter sizes. The largest model has 2B parameters and the highest capacity; the others have 500M parameters, 100M parameters, and the smallest 10M parameter model. Study of applied mathematics problems In addition to arithmetic tasks, this paper also trained (fine-tuned) a series of Transformer-based language models, called the General Language Model (GLM) and its chat version, to solve mathematical application problems. The training process used the public Chinese Ape210K dataset, which contains 210,000 Chinese elementary school math problems, and the answer to each problem is directly calculated. In order to improve the performance of MathGLM on math word problems, this paper adopts a step-by-step strategy to reconstruct the Ape210K dataset and convert it into a version that calculates the answer to each math problem step by step. Figure 4 shows the comparison between the original Ape210K dataset and the reconstructed version of this paper. This paper uses different variants of GLM as the backbone to train MathGLM, including GLM-large, GLM-6B, GLM2-6B and GLM-10B with 335M parameters. In addition, this paper also uses ChatGLM-6B and ChatGLM2-6B backbone networks to train MathGLM. These backbone models give MathGLM basic language understanding capabilities, enabling it to effectively understand the language information contained in mathematical word problems. experiment Two different types of experiments were designed in this paper, including arithmetic tasks and mathematical word problems. For arithmetic tasks, we pre-trained a Transformer-based MathGLM model with 500M parameters and compared its performance with leading large language models (LLMs) such as GPT-4 and ChatGPT. The results are shown in Table 3. MathGLM outperforms all other models, indicating that MathGLM has excellent performance in handling arithmetic tasks. Even with just 10 million parameters, the results are astonishing: MathGLM-10M outperforms GPT-4 and ChatGPT on a range of comprehensive arithmetic tasks. Furthermore, when comparing MathGLMs of different parameter sizes, we observed that the arithmetic performance of MathGLMs is directly related to the increase in their number of parameters. This finding suggests that their performance increases proportionally with the increase in model size. In summary, the researchers' evaluation results on complex arithmetic tasks show that MathGLM has excellent performance. By decomposing arithmetic tasks, the performance of these models significantly exceeds GPT-4 and ChatGPT. In addition, this paper also compares GPT-4, ChatGPT, text-davinci-003, code-davinci-002, Galacica, LLaMA, OPT, BLOOM and GLM. This paper randomly extracts a compact arithmetic dataset containing 100 test cases from the large data set discussed earlier. The results are shown in Table 4. From the above analysis results, we can see that MathGLM achieves an accuracy of 93.03% with 2 billion parameters, surpassing all other LLMs. For mathematical application problems, this paper conducts experiments on the Ape210K dataset. Table 8 reports the results including MathGLM variants, GPT-4, ChatGPT, etc. The results show that when used with GLM-10B, MathGLM achieves comparable performance levels to the state-of-the-art GPT-4 model in terms of answer accuracy. Furthermore, when comparing the performance of MathGLM with GLM-Large, GLM-6B, and GLM-10B, a clear trend emerged: MathGLM showed significant improvements in both arithmetic accuracy and answer accuracy. In order to evaluate the model's ability to solve math problems at different grades, the study tested and evaluated the performance of several models on the K6 dataset, including: GPT-4, ChatGPT, Chinese-Alpaca-13B, MOSS-16B, Ziya-LLaMA-13B, Baichuan-7B, ChatGLM-6B, ChatGLM2-6B and MathGLM-GLM-10B. The results are shown in Figure 8 below. Paper address: https://arxiv.org/pdf/2309.03241v2.pdf Project address: https://github.com/THUDM/MathGLM#arithmetic-tasks |
<<: What happens if you don't turn off the lights at night? It may kill small animals...
>>: The teeth we use to eat may have been fish scales a long time ago!
Ganoderma lucidum has been regarded as the holy h...
Turtle shell glue [picture] is a very common Chin...
Now when people mention the Eight Banners soldier...
On January 1, 2019, NASA's New Horizon probe ...
In terms of health care, "detoxification&quo...
Many people are very familiar with parsnips. The ...
The root of white pomegranate can clear away heat...
With the safe return of the three astronauts Zhai...
Produced by: Science Popularization China Author:...
Ganoderma lucidum has a long history, and up to n...
Bee roe is very familiar to everyone. It has a lo...
Recently, my country's Insight-HXMT satellite...
Nowadays, our living standards are constantly imp...
In "How did he achieve immortal glory while ...
When it comes to the root of ophiopogon, we are a...