Better than GPT-4, the 2 billion parameter model solves arithmetic problems with almost 100% accuracy

Currently, large language models (LLMs) have demonstrated excellent capabilities in handling various downstream tasks in the field of NLP. In particular, pioneering models such as GPT-4 and ChatGPT have been trained on large amounts of text data, giving them strong text understanding and generation capabilities, enabling them to generate coherent and contextually relevant responses, and are highly versatile in various NLP tasks.

However, LLM's performance in mathematical reasoning is not satisfactory. LLM has difficulty accurately performing complex arithmetic operations, especially operations involving multiplication of numbers with more than 8 digits, and operations involving decimals and fractions.

Based on this, researchers from Tsinghua University, TAL AI Lab and Zhipu AI jointly proposed a new model - MathGLM, which can perfectly perform complex arithmetic operations.

The study shows that with enough training data, the 2 billion-parameter language model can accurately perform multi-digit arithmetic operations with an accuracy of almost 100% and no data leakage. This result far exceeds GPT-4 (whose multi-digit multiplication accuracy is only 4.3%).

Method Introduction

This paper proposes a model named MathGLM to explore the efficiency of LLM in mathematical reasoning.

The arithmetic tasks that need to be completed in the MathGLM model can be roughly divided into two categories: basic arithmetic operations and complex mixed operations. Basic arithmetic operations include basic mathematical tasks that revolve around simple calculations of two numbers. Complex mixed operations involve a combination of different arithmetic operations and number formats (such as integers, decimals, fractions, etc.). Table 1 shows the classification of MathGLM tasks.

To enhance the arithmetic capabilities of MathGLM, we adopt a Transformer-based decoder-only architecture and train it from scratch on a generated arithmetic dataset using an autoregressive objective.

Learning arithmetic tasks

The arithmetic training dataset is carefully designed to include multiple operations such as addition, subtraction, multiplication, division, and exponentiation. In addition, it also contains multiple number formats such as integers, decimals, percentages, fractions, and negative numbers. The datasets vary in size, ranging from 1 million to 50 million records.

In each dataset, a single arithmetic expression consists of 2 to 10 steps, covering a range of mathematical operations such as addition (+), subtraction (-), multiplication (×), division (/), and exponentiation (^). Figure 3 shows some training examples extracted from the arithmetic dataset:

Table 2 summarizes the different sizes of MathGLM models, including 4 different types of models, each with different parameter sizes. The largest model has 2B parameters and the highest capacity; the others have 500M parameters, 100M parameters, and the smallest 10M parameter model.

Study of applied mathematics problems

In addition to arithmetic tasks, this paper also trained (fine-tuned) a series of Transformer-based language models, called the General Language Model (GLM) and its chat version, to solve mathematical application problems. The training process used the public Chinese Ape210K dataset, which contains 210,000 Chinese elementary school math problems, and the answer to each problem is directly calculated.

In order to improve the performance of MathGLM on math word problems, this paper adopts a step-by-step strategy to reconstruct the Ape210K dataset and convert it into a version that calculates the answer to each math problem step by step. Figure 4 shows the comparison between the original Ape210K dataset and the reconstructed version of this paper.

This paper uses different variants of GLM as the backbone to train MathGLM, including GLM-large, GLM-6B, GLM2-6B and GLM-10B with 335M parameters. In addition, this paper also uses ChatGLM-6B and ChatGLM2-6B backbone networks to train MathGLM. These backbone models give MathGLM basic language understanding capabilities, enabling it to effectively understand the language information contained in mathematical word problems.

experiment

Two different types of experiments were designed in this paper, including arithmetic tasks and mathematical word problems.

For arithmetic tasks, we pre-trained a Transformer-based MathGLM model with 500M parameters and compared its performance with leading large language models (LLMs) such as GPT-4 and ChatGPT. The results are shown in Table 3. MathGLM outperforms all other models, indicating that MathGLM has excellent performance in handling arithmetic tasks.

Even with just 10 million parameters, the results are astonishing: MathGLM-10M outperforms GPT-4 and ChatGPT on a range of comprehensive arithmetic tasks.

Furthermore, when comparing MathGLMs of different parameter sizes, we observed that the arithmetic performance of MathGLMs is directly related to the increase in their number of parameters. This finding suggests that their performance increases proportionally with the increase in model size.

In summary, the researchers' evaluation results on complex arithmetic tasks show that MathGLM has excellent performance. By decomposing arithmetic tasks, the performance of these models significantly exceeds GPT-4 and ChatGPT.

In addition, this paper also compares GPT-4, ChatGPT, text-davinci-003, code-davinci-002, Galacica, LLaMA, OPT, BLOOM and GLM. This paper randomly extracts a compact arithmetic dataset containing 100 test cases from the large data set discussed earlier. The results are shown in Table 4.

From the above analysis results, we can see that MathGLM achieves an accuracy of 93.03% with 2 billion parameters, surpassing all other LLMs.

For mathematical application problems, this paper conducts experiments on the Ape210K dataset. Table 8 reports the results including MathGLM variants, GPT-4, ChatGPT, etc.

The results show that when used with GLM-10B, MathGLM achieves comparable performance levels to the state-of-the-art GPT-4 model in terms of answer accuracy.

Furthermore, when comparing the performance of MathGLM with GLM-Large, GLM-6B, and GLM-10B, a clear trend emerged: MathGLM showed significant improvements in both arithmetic accuracy and answer accuracy.

In order to evaluate the model's ability to solve math problems at different grades, the study tested and evaluated the performance of several models on the K6 dataset, including: GPT-4, ChatGPT, Chinese-Alpaca-13B, MOSS-16B, Ziya-LLaMA-13B, Baichuan-7B, ChatGLM-6B, ChatGLM2-6B and MathGLM-GLM-10B. The results are shown in Figure 8 below.

Paper address: https://arxiv.org/pdf/2309.03241v2.pdf

Project address: https://github.com/THUDM/MathGLM#arithmetic-tasks

<<: What happens if you don't turn off the lights at night? It may kill small animals...

>>: The teeth we use to eat may have been fish scales a long time ago!

The efficacy and function of lentil skin

The gas tank explosion at a barbecue restaurant in Yinchuan has killed 31 people! If a gas tank catches fire or explodes, how can you save yourself?

On the evening of June 21, a gas explosion occurr...

Better than GPT-4, the 2 billion parameter model solves arithmetic problems with almost 100% accuracy

The efficacy and function of lentil skin

What is Oregano

Platycladus orientalis leaves can treat seborrheic alopecia

Is it OK to eat seafood while taking Chinese medicine?

"Fertile Field" in 2,000-meter-high strata: China discovers its first 100 billion cubic meter deep coal seam gas field

The childhood snack "shredded figs" has nothing to do with figs?

The efficacy of Xinkaihe ginseng

Not too hot, and still chewy, are cold Yuanxiao so delicious? I advise you not to eat it this way!

The efficacy of American ginseng and Ganoderma lucidum soaked in water

How to drink Cistanche deserticola in water

Recommend

Which Chinese medicines can enhance sexual function?

The efficacy and function of Tibetan catfish bone

What are the effects, functions and edible methods of Chinese Rhizoma Achyranthis Bidentatae?

What are the effects and side effects of Poria cocos?

What kind of garden is the modern rain garden?

I bought a Venus Flytrap and put it in my home and found that…

What are the effects and eating methods of Moringa seeds?

The gas tank explosion at a barbecue restaurant in Yinchuan has killed 31 people! If a gas tank catches fire or explodes, how can you save yourself?

Can I eat chili peppers after taking Chinese medicine?

The recognition rate is close to 90%! AI robot: Tumor, where can you hide?

Skim milk vs. whole milk, which one is healthier? It turns out that many people are not drinking it right

From ancient times to the present, how have we "dealt with" the wind?

When you have cold symptoms, these items in your home may have been contaminated with the virus

How to eat and its effects

CNNIC: 2011 China's 29th Internet User Survey Report