Techniques comparable to reinforcement learning with human feedback (RLHF) have emerged. Recently, researchers from Google Research proposed reinforcement learning with AI feedback (RLAIF), a technique that can produce human-level performance and provide a potential solution to the scalability limitations of reinforcement learning with human feedback (RLHF) . The related paper, titled “RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback”, has been published on the preprint website arXiv. RLHF: Leading to inaccurate or harmful behavior RLHF is a method for fine-tuning pre-trained large language models (LLMs) using human guidance. It consists of three interrelated processes: feedback collection, reward modeling, and policy optimization. Among them, feedback collection is responsible for collecting human evaluations of LLMs outputs. This feedback data is then used to train the reward model through supervised learning. The reward model is designed to simulate human preferences. Subsequently, the policy optimization process uses a reinforcement learning loop to optimize the LLMs to produce outputs that receive favorable evaluations from the reward model. These steps can be performed iteratively or simultaneously. The key advantages of RLHF over traditional RL methods are better alignment with human intentions, planning conditional on future feedback, fluid learning from various types of feedback, and collating feedback as needed, all of which are indispensable for creating truly intelligent agents. Additionally, RLHF allows machines to learn by abstracting human values rather than simply imitating human behavior, making the agent more adaptable, more interpretable, and more reliable in decision making. Currently, RLHF has been widely used in fields such as business, education, medical care, and entertainment , including OpenAI's ChatGPT, DeepMind's Sparrow, and Anthropic's Claude. However, AI models based on RLHFs have the potential to behave inaccurately or harmfully. Moreover, collecting human preference data as feedback is costly, and disagreements between human annotators introduce variances into the training data, which can cause confusion in situations where the ground truth is ambiguous (e.g., moral dilemmas). In addition, human feedback in RLHFs is often restricted to the form of preference rankings that provide limited information, limiting applicability. RLAIF vs. RLHF In this work, RLAIF showed the potential to solve the RLHF problem . The researchers used a general LLMs model to annotate preferences between candidate pairs, which was pre-trained or fine-tuned for general purposes but not fine-tuned for a specific downstream task. Given a text and two candidate summaries, LLMs are asked to evaluate which summary is better. The input structure is as follows: 1. Introduction – Instructions that introduce and describe the task at hand; 2. A small number of examples - a text example, a pair of summaries, the logical basis of the chain of thoughts (CoT), and a preference judgment; 3. Samples to be annotated - a text and a pair of summaries to be annotated; 4. End – the end string used to indicate the end of LLMs; After the LLMs received the input, the researchers obtained the log-probability of generating tokens “1” and “2” and then calculated the softmax to obtain the preference distribution. They conducted two types of preface experiments. In the “Base” experiment, they simply asked “Which summary is better?”, while in the “OpenAI” experiment, they mimicked the rating instructions generated by human preference annotators in the OpenAI TLDR project, where these tokens contain detailed information about what is needed to build a strong summary. In addition, they conducted contextual learning experiments to provide more context by adding some manually selected examples covering different topics. After LLMs labeled preferences, the researchers trained a reward model (RM) to predict preferences. Subsequently, three metrics were used to evaluate the AI’s label alignment, pairwise accuracy, and win rate. Experimental results show that RLAIF can be a viable alternative to RLHF without relying on human annotators . In human evaluation, RLAIF achieves 71% favorability over the baseline supervised fine-tuning strategy, while RLHF achieves 73% favorability over the baseline supervised fine-tuning model strategy. In addition, the study directly compared the win rates of RLAIF and RLHF in terms of human preference, and the results showed that they had the same popularity under human evaluation. The study also compared the summaries of RLAIF and RLHF with the human-written reference summaries. In 79% of cases, the RLAIF summary outperformed the reference summary, while the RLHF outperformed the reference summary in 80% of cases. However, while this work highlights the potential of RLAIF, there are some limitations . First, the study only focused on the summary task, and its generalization performance on other tasks is unclear; second, the study did not fully evaluate the cost-effectiveness of LLMs reasoning compared to manual annotation; in addition, there are many interesting unresolved issues, such as whether combining RLHF with RLAIF can surpass a single method, how effective is it to directly use LLMs to allocate rewards, whether improving AI label alignment can be translated into an improved final strategy, and whether using LLMs annotators of the same size as the policy model can further improve the strategy. It is undeniable that this study has laid a solid foundation for in-depth research in the field of RLAIF, and we look forward to more outstanding results in this field in the future. Reference Links: https://arxiv.org/abs/2309.00267 https://bdtechtalks.com/2023/09/04/rlhf-limitations/ Author: Yan Yimi Editor: Academic |
>>: Tumor "natural killer"! New progress in NK cell research at USTC
Nowadays, our living standards are constantly imp...
With the continuous advancement of science and te...
The wolfberry is a perennial herb. Its grass can ...
Flowers are very common to us. There are many kin...
“Where’s the slow loris?” Only the back remains d...
Gynostemma pentaphyllum is one of the most common...
Golden Thread Gourd is a perennial herb, 20-40 cm...
It’s the autumn outing season again, and if we ta...
In fact, we all know that many of the foods we ea...
With the rapid development of information technol...
As people's living standards improve, they pa...
He is the founder of modern mechanics and theoret...
Wild ginseng and polygonatum are two plants, name...
Although drinks are good, don’t drink them as wat...
This cold wave at the end of the year is not simp...