Learning Code Preference via Synthetic Evolution

Jiawei Liu¹, Thanh Nguyen², Mingyue Shang², Hantian Ding²
Xiaopeng Li², Yu Yu², Varun Kumar², Zijian Wang²

¹ University of Illinois Urbana-Champaign ² AWS AI Labs

Paper Code Tweet

🙋 Quiz time! Try this out! 📝

How to effectively & efficiently obtain code preference is crucial to study! To this end, we present:

✨Technique: we introduce CodeFavor, an open recipe for training code preference models with synthetic code evolution such as code commits and code critiques.
✨Benchmark: we release CodePrefBench -- 1364 rigorously curated code preference tasks, covering verifiable objectives (✅correctness, 🚀efficiency, 🛡️security) and 👍human preference.
✨Findings: (i) qualifying cost & performance of human preference based on 18 developers; (ii) controlled experiments on data, code comments, criteria, & modeling in training code preference models; and (iii) case studies of LLM preference towards code correctness, efficiency, and security.

🔗 Technique 🔗 Benchmark 🔗 Quantifying Human Preference 🔗 Controlled Experiments 🔗 Case Studies

CodeFavor: Code Evolution → Code Preference

Preference Modeling

CodeFavor trains code preference models that predict pairwise preference using the following input formats:

Instruction: The instruction associated with the code candidates.
Code pair: A pair of code snippet candidates to be compared.
Criteria: The criteria for comparison, such as correctness and efficiency.

Based on this format, the preference model is trained to predict code preference using two modeling methods:

Classification: the preference is decided by the decoding probability of next single tokens such as "A" or "B" (following SLiC-HF).
Generation: the preference is decided by the generation of analysis and conclusion over the code pair.

💡 Design considerations

Synthetic Data Generation

Code commits (Commit-Instruct): CodeFavor transforms a given code commit into a synthetic code preference training sample. Specifically, it employs a critic model to rephrase the pre-commit code to a rejected code candidate and the post-commit code to the accepted code candidate. Code quality filtering is further applied to ensure the quality of the preference samples.

Code critiques (Critic-Evol): Given an code instruction, CodeFavor first lets a smaller model generate a draft code candidate. A larger critic model is then used to check its code quality and provide feedback. If the draft code is considered improveable, the critic model will provide a revised code candidate, combining both we obtain a contrastive code pair for training the preference model.

CodePrefBench: Evaluating Code Preferences

Instructions to run CodePreferenceBench is available here!

💡 Sample task

Setup

Code preference can never be evaluated from just one angle! Therefore, we adapt a comprehensive set of objectives (encoded by the criteria field) to evaluate the code preferences of both LLMs and human developers:

✅ Code correctness: the ground-truth label is determined by exercising code candiates over massive EvalPlus test cases.
🚀 Code efficiency: labels are obtained by profiling the # of CPU instructions of code candidates over performan-exercising tasks and test inputs in EvalPerf.
🛡️ Code security: each task includes a pair of secure and vulnerable code snippets, whose security is exaimined by static analyzers in CyberSecEval.
👍 Human preference: we engage 18 developers, with 3 annotator per code pair. The code candidates are sampled from LLMs and are paired using the maximum edit distance to ensure sufficient differentiation. Detailed annotation statistics are available in the Quantifying Human Preference section.

⚙️ Examinee setup

📦 CodeFavor data

📦 Commit-Instruct-EditPack: 20.6k code preference samples synthesized using permissive code commits from EditPackFT (22.4k permissive Python commits). We use Llama-3-70B-Instruct as the critic model.
📦 Critic-Evol-SOSS: 41.6k code preference samples. We run Llama-3-8B-Instruct as the draft model over 50.6k coding instructions from the Self-OSS-Instruct dataset to produce draft code responses. We then use Llama-3-70B-Instruct as the critic model to provide feedback and generate the revised code, after filtering out 17.9% of "good enough" drafts.
🔨 Post-processing: To mitigate positional bias, we augment the dataset by flipping the order within each code pair, which also doubles the training samples. Besides, we clip the code comments in Critic-Evol samples, given that comments barely affect code quality metrics and LLM-generated comments may let faulty code "sound right".

Main Results

Quantifying Human Preference

⚙️ Human baseline setup

🔍 Human preference strongly aligns with code correctness, but can struggle with non-functional properties.

The result figure presents the accuracy of human preference via 3-developer major voting:

✅ Correctness: human preference best aligns with code correctness, which is also the most challenging category for LLMs.
🚀 Efficiency: human preference can be suboptimal to the best LLMs.
🛡️ Security: human preference is unsure for a significant portion (73.9%) of code security pairs.

Confidence

Developers are overall confident with their annotations, esp. for code correctness, whose confidence level is higher than that for efficiency and security, corresponding to the accuracy results. Notes show that it's partially because correctness can be assessed by manual testing, while others are less straightforward to evaluate.

Human confidence for different objectives

Time

Labeling human pref. is time-consuming: after removing top-1%-longest outliers, each task can cost each developer 7.8 minutes on average, with the 99-percentile of 26 minutes. This is expected as code can be much harder to understand than general natural language.
The code correctness category is faster/easier to label than that of efficiency and security, corresponding to the accuracy results and confidence levels.

CDF of labeling time (top-left is faster)

Cost

Human preference can be higly costly: it can be two orders of magnitude more expensive than serving one of the largest open-weight LLMs. Yet, the overall accuracy of human preference is not as competitive as that for such large models (mainly due to the huge gap in judging code security).

Estimated per-sample cost and accuracy

Implications and Open Questions

Human preference strongly aligns with code correctness, but can struggle with non-functional properties -- shall we focus on using human preference for selecting functional code and leave non-functional properties to LLMs or external tools?

Labeling human preference for code is expensive and time-consuming -- how can we improve the productivity of human preference for more cost-effective and faster preference annotation?

Code is much harder to understand than general natural language -- even code preferences from experienced developers can be imperfect -- how can we assure the quality of openly crowd-sourced preference votes for code?

Controlled Experiments

We compile a list of empirical conclusions based on our controlled experiments:

🔍 Model merging outperforms data mixture by ~5%.

We have two sources of datasets: Commit-Instruct-EditPack and Critic-Evol-SOSS. To make use of both datasets, we tried two strategies: (i) data mixture -- directly mix the two datasets; and (ii) model merging -- train two models separately and merge them in the inference stage. The result indicates that model merging can lightly surpass data mixture by ~5% in overall accuracy.

🔍 Classification v.s. generation: tradeoffs