Is ChatGPT's core technology going to be replaced?

Techniques comparable to reinforcement learning with human feedback (RLHF) have emerged.

Recently, researchers from Google Research proposed reinforcement learning with AI feedback (RLAIF), a technique that can produce human-level performance and provide a potential solution to the scalability limitations of reinforcement learning with human feedback (RLHF) .

The related paper, titled “RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback”, has been published on the preprint website arXiv.

RLHF: Leading to inaccurate or harmful behavior

RLHF is a method for fine-tuning pre-trained large language models (LLMs) using human guidance. It consists of three interrelated processes: feedback collection, reward modeling, and policy optimization.

Among them, feedback collection is responsible for collecting human evaluations of LLMs outputs. This feedback data is then used to train the reward model through supervised learning. The reward model is designed to simulate human preferences. Subsequently, the policy optimization process uses a reinforcement learning loop to optimize the LLMs to produce outputs that receive favorable evaluations from the reward model. These steps can be performed iteratively or simultaneously.

The key advantages of RLHF over traditional RL methods are better alignment with human intentions, planning conditional on future feedback, fluid learning from various types of feedback, and collating feedback as needed, all of which are indispensable for creating truly intelligent agents.

Additionally, RLHF allows machines to learn by abstracting human values rather than simply imitating human behavior, making the agent more adaptable, more interpretable, and more reliable in decision making.

Currently, RLHF has been widely used in fields such as business, education, medical care, and entertainment , including OpenAI's ChatGPT, DeepMind's Sparrow, and Anthropic's Claude.

However, AI models based on RLHFs have the potential to behave inaccurately or harmfully. Moreover, collecting human preference data as feedback is costly, and disagreements between human annotators introduce variances into the training data, which can cause confusion in situations where the ground truth is ambiguous (e.g., moral dilemmas). In addition, human feedback in RLHFs is often restricted to the form of preference rankings that provide limited information, limiting applicability.

RLAIF vs. RLHF

In this work, RLAIF showed the potential to solve the RLHF problem .

The researchers used a general LLMs model to annotate preferences between candidate pairs, which was pre-trained or fine-tuned for general purposes but not fine-tuned for a specific downstream task.

Given a text and two candidate summaries, LLMs are asked to evaluate which summary is better. The input structure is as follows:

1. Introduction – Instructions that introduce and describe the task at hand;

2. A small number of examples - a text example, a pair of summaries, the logical basis of the chain of thoughts (CoT), and a preference judgment;

3. Samples to be annotated - a text and a pair of summaries to be annotated;

4. End – the end string used to indicate the end of LLMs;

After the LLMs received the input, the researchers obtained the log-probability of generating tokens “1” and “2” and then calculated the softmax to obtain the preference distribution.

They conducted two types of preface experiments. In the “Base” experiment, they simply asked “Which summary is better?”, while in the “OpenAI” experiment, they mimicked the rating instructions generated by human preference annotators in the OpenAI TLDR project, where these tokens contain detailed information about what is needed to build a strong summary.

In addition, they conducted contextual learning experiments to provide more context by adding some manually selected examples covering different topics.

After LLMs labeled preferences, the researchers trained a reward model (RM) to predict preferences. Subsequently, three metrics were used to evaluate the AI’s label alignment, pairwise accuracy, and win rate.

Experimental results show that RLAIF can be a viable alternative to RLHF without relying on human annotators . In human evaluation, RLAIF achieves 71% favorability over the baseline supervised fine-tuning strategy, while RLHF achieves 73% favorability over the baseline supervised fine-tuning model strategy.

In addition, the study directly compared the win rates of RLAIF and RLHF in terms of human preference, and the results showed that they had the same popularity under human evaluation. The study also compared the summaries of RLAIF and RLHF with the human-written reference summaries. In 79% of cases, the RLAIF summary outperformed the reference summary, while the RLHF outperformed the reference summary in 80% of cases.

However, while this work highlights the potential of RLAIF, there are some limitations .

First, the study only focused on the summary task, and its generalization performance on other tasks is unclear; second, the study did not fully evaluate the cost-effectiveness of LLMs reasoning compared to manual annotation; in addition, there are many interesting unresolved issues, such as whether combining RLHF with RLAIF can surpass a single method, how effective is it to directly use LLMs to allocate rewards, whether improving AI label alignment can be translated into an improved final strategy, and whether using LLMs annotators of the same size as the policy model can further improve the strategy.

It is undeniable that this study has laid a solid foundation for in-depth research in the field of RLAIF, and we look forward to more outstanding results in this field in the future.

Reference Links:

https://arxiv.org/abs/2309.00267

https://bdtechtalks.com/2023/09/04/rlhf-limitations/

Author: Yan Yimi

Editor: Academic

<<: Interview with UIUC Li Bo | From usability to trustworthiness, the ultimate thinking of the academic community on AI

>>: Tumor "natural killer"! New progress in NK cell research at USTC

What else should children prevent besides epidemic prevention when they return to school in the autumn and winter?

New pattern of tourism economy and new orientation of industrial policy: On the tourism economic situation and summer consumption outlook in the first half of 2023

Articles

Side effects of Gentiana macrophylla

Chinese Medicine

Around the Mid-Autumn Festival, the delicacy that I have been craving for a year is finally on the market! It’s not mooncakes!

Author: Fluent AI Mapping As a fruit lover, my fa...

Wearing reading glasses, with gray hair... at the age of 55, he set foot on space three times: I want to work in aerospace all my life, it’s worth it!

Two manned missions He also served as the command...

The efficacy and function of Sonneratia

Sonneratia is commonly used as a medicinal materi...

Is ChatGPT's core technology going to be replaced?

What else should children prevent besides epidemic prevention when they return to school in the autumn and winter?

Does the "Golden Eye" in "007" really exist? How is it a space weapons control center?

The efficacy and function of horseshoe crab

What are the effects and functions of Zhejiang Fritillary

Nothing tastes better than dumplings, and nothing is more fun than...cooking dumplings

How to consume Astragalus

What are the ways to eat wild Cynomorium songaricum

The efficacy and function of Polygonatum odoratum

New pattern of tourism economy and new orientation of industrial policy: On the tourism economic situation and summer consumption outlook in the first half of 2023

Side effects of Gentiana macrophylla

Recommend

The efficacy and function of black grass

The efficacy and function of Leyu

How to eat white Poria cubes

Attention! Children's dry eye syndrome should not be ignored

Invisible attacks, destructive curses…How many unique skills do the “vanguards” of microbial pesticides have?

A review of the popular science words in 2023

Shh! Here's a quick guide to surviving a drinking party

Dietary taboos of Rehmannia glutinosa

What are the effects of stone ginseng root

Around the Mid-Autumn Festival, the delicacy that I have been craving for a year is finally on the market! It’s not mooncakes!

The secret the barber shop won’t tell you: Why does dyed hair always turn yellow in the end?

The secrets of the Spring Festival Gala magic show are revealed! We discovered the step where Nigmat failed!

Calamine Menthol Lotion

Wearing reading glasses, with gray hair... at the age of 55, he set foot on space three times: I want to work in aerospace all my life, it’s worth it!

The efficacy and function of Sonneratia