CS 194/294-280 Advanced LLM Agents Notes - Lecture 4: Reasoning and Planning in Large Language Models
Series Paper Reading
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Tulu 3 is a family of fully-open state of the art post-trained models, alongside its data, code and training recipes, serving as a comprehensive guide for modern post-training techniques, aiming to close the gap between open and closed post training.
Stage 1: Data Curation
We create a variety of prompts to be allocated across multiple stages of optimization and also new synthetic prompts.
Dara Quality, Provenance and Scale: Targeted prompts to be influential to improve core skills, while real-world queries are important to improve general chat capabilities.
Stage 2: SFT
SFT on selected prompts and completions, optimizing data and training hyperparameters through experimentation to enhance target skills.
Creating a Multo-Skill SFT Dataset: The distribution of the prompts in the “general" and “skill-specific” categories was refined by several rounds of supervised finetuning on various data mixtures.
Stage 3: PT
Preference tuning, specifically DO to newly curated on-policy synthetically created preference data from selected prompts along with off-policy data.
Curating an On-Policy Preference Dataset
Preference Tuning Algorithm Design
Stage 4: RLVR
RL-based post-training stage which trains the model on verifiable rewards instead of a reward model, as is common for traditional RLHF training. We select tasks with verifiable outcomes, such as mathematical problem-solving, and only provide rewards when the model’s generations are verified to be correct.
Skill-specific RL with Verifiable Rewards
Training Infra for RL
Key Words:
Post-trained: The process of refining AI models after the pretraining phase, includes fine-tuning model on smaller, task-specific datasets to improve its performance on particular tasks.
SFT(Supervised finetuning)
DPO(Direct Preference Optimization): DPO simplifies the complex and unstable process from the beginning, changing the reward model in RLHF and make the fine-tuning process more stable, it does away with the need for repeated sampling and constant adjustments during training.
RLVR(Reinforcement Learning with Verifiable Rewards): RLVR is a reinforcement learning framework tailored for training large language models to excel in tasks with deterministic correctness criterias. It builds on the foundational supervised finetuning of LLMs and introduce a reinforcement learning mechanisms that leverage binary or scaled rewards derived from explicit task validations.
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
Although many models have adopted the final stage of training called “learning from preference feedback”, it still unclear what aspects of learning from preferences matters most for downstream model performance, like PPO and DPO.
Four core aspects of preference-based learning include preference data, learning algorithm, reward model and policy training prompts are studied to evaluate the impact of these components on downstream model performance.
DPO:
PPO: Proximal Policy Optimization, which is a reinforcement learning algorithm widely used for training complex policies, including those for LLMs. It focuses on improving the model’s policy iteratively while maintaining stability.
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs
Synthesizing knowledge from scientific literature is essential for uncovering new research directions, refining methodologies and supporting evidence-based decisions. An effective synthesis require precise retrieval, accurate attribution and real-time access to current literature.
OpenScholar is a specialized retrieval-augmented LM that answers scientific queries by identifying relevant passage from 45M open-access papers and synthesizing citation-based responses.
ScholarQABench is also released as first large-scale multi-domain benchmark for literature search, comprising 2967 expert-written queries and 208 long-form answers.
Key Problems are:
Hallucinations
Reliance on outdated pre-training data
Lack of transparent attribution