Learning Code Preference via Synthetic Evolution

1 University of Illinois Urbana-Champaign       2 AWS AI Labs
๐Ÿ™‹ Quiz time! Try this out! ๐Ÿ“

How to effectively & efficiently obtain code preference is crucial to study! To this end, we present:
  • โœจTechnique: we introduce CodeFavor, an open recipe for training code preference models with synthetic code evolution such as code commits and code critiques.
  • โœจBenchmark: we release CodePrefBench -- 1364 rigorously curated code preference tasks, covering verifiable objectives (โœ…correctness, ๐Ÿš€efficiency, ๐Ÿ›ก๏ธsecurity) and ๐Ÿ‘human preference.
  • โœจFindings: (i) qualifying cost & performance of human preference based on 18 developers; (ii) controlled experiments on data, code comments, criteria, & modeling in training code preference models; and (iii) case studies of LLM preference towards code correctness, efficiency, and security.

CodeFavor: Code Evolution โ†’ Code Preference


CodeFavor is a framework that trains pairwise preference models with synthetic code preferences generated from code evolution like code commits and code critiques.

Preference Modeling

CodeFavor trains code preference models that predict pairwise preference using the following input formats:
  1. Instruction: The instruction associated with the code candidates.
  2. Code pair: A pair of code snippet candidates to be compared.
  3. Criteria: The criteria for comparison, such as correctness and efficiency.

Based on this format, the preference model is trained to predict code preference using two modeling methods:
  1. Classification: the preference is decided by the decoding probability of next single tokens such as "A" or "B" (following SLiC-HF).
  2. Generation: the preference is decided by the generation of analysis and conclusion over the code pair.

๐Ÿ’ก Design considerations



Synthetic Data Generation

  • Code commits (Commit-Instruct): CodeFavor transforms a given code commit into a synthetic code preference training sample. Specifically, it employs a critic model to rephrase the pre-commit code to a rejected code candidate and the post-commit code to the accepted code candidate. Code quality filtering is further applied to ensure the quality of the preference samples.

  • Code critiques (Critic-Evol): Given an code instruction, CodeFavor first lets a smaller model generate a draft code candidate. A larger critic model is then used to check its code quality and provide feedback. If the draft code is considered improveable, the critic model will provide a revised code candidate, combining both we obtain a contrastive code pair for training the preference model.

CodePrefBench: Evaluating Code Preferences


Overview of CodePreferenceBench.

๐Ÿ’ก Sample task


Setup

Code preference can never be evaluated from just one angle! Therefore, we adapt a comprehensive set of objectives (encoded by the criteria field) to evaluate the code preferences of both LLMs and human developers:
  1. โœ… Code correctness: the ground-truth label is determined by exercising code candiates over massive EvalPlus test cases.
  2. ๐Ÿš€ Code efficiency: labels are obtained by profiling the # of CPU instructions of code candidates over performan-exercising tasks and test inputs in EvalPerf.
  3. ๐Ÿ›ก๏ธ Code security: each task includes a pair of secure and vulnerable code snippets, whose security is exaimined by static analyzers in CyberSecEval.
  4. ๐Ÿ‘ Human preference: we engage 18 developers, with 3 annotator per code pair. The code candidates are sampled from LLMs and are paired using the maximum edit distance to ensure sufficient differentiation. Detailed annotation statistics are available in the Quantifying Human Preference section.

โš™๏ธ Examinee setup

๐Ÿ“ฆ CodeFavor data



Main Results


Accuracy (%) of evaluated models on CodePrefBench. Scores within 1pp of the highest are highlighted in bold. Bracketed numbers denote the ranges of uncertain responses, half of whose ratio is accounted for the final accuracy score.

Quantifying Human Preference

โš™๏ธ Human baseline setup

๐Ÿ” Human preference strongly aligns with code correctness, but can struggle with non-functional properties.

The result figure presents the accuracy of human preference via 3-developer major voting:
  • โœ… Correctness: human preference best aligns with code correctness, which is also the most challenging category for LLMs.
  • ๐Ÿš€ Efficiency: human preference can be suboptimal to the best LLMs.
  • ๐Ÿ›ก๏ธ Security: human preference is unsure for a significant portion (73.9%) of code security pairs.

Confidence

Developers are overall confident with their annotations, esp. for code correctness, whose confidence level is higher than that for efficiency and security, corresponding to the accuracy results. Notes show that it's partially because correctness can be assessed by manual testing, while others are less straightforward to evaluate.
Human confidence for different objectives

Time

  1. Labeling human pref. is time-consuming: after removing top-1%-longest outliers, each task can cost each developer 7.8 minutes on average, with the 99-percentile of 26 minutes. This is expected as code can be much harder to understand than general natural language.
  2. The code correctness category is faster/easier to label than that of efficiency and security, corresponding to the accuracy results and confidence levels.
CDF of labeling time (top-left is faster)

Cost

Human preference can be higly costly: it can be two orders of magnitude more expensive than serving one of the largest open-weight LLMs. Yet, the overall accuracy of human preference is not as competitive as that for such large models (mainly due to the huge gap in judging code security).
Estimated per-sample cost and accuracy

Implications and Open Questions

Human preference strongly aligns with code correctness, but can struggle with non-functional properties -- shall we focus on using human preference for selecting functional code and leave non-functional properties to LLMs or external tools?
Labeling human preference for code is expensive and time-consuming -- how can we improve the productivity of human preference for more cost-effective and faster preference annotation?
Code is much harder to understand than general natural language -- even code preferences from experienced developers can be imperfect -- how can we assure the quality of openly crowd-sourced preference votes for code?

Controlled Experiments

We compile a list of empirical conclusions based on our controlled experiments:

๐Ÿ” Model merging outperforms data mixture by ~5%.
We have two sources of datasets: Commit-Instruct-EditPack and Critic-Evol-SOSS. To make use of both datasets, we tried two strategies: (i) data mixture -- directly mix the two datasets; and (ii) model merging -- train two models separately and merge them in the inference stage. The result indicates that model merging can lightly surpass data mixture by ~5% in overall accuracy.
๐Ÿ” Classification v.s. generation: tradeoffs

Impact of training data and modeling in training CodeFavor models.

๐Ÿ” Detailing criteria leads to more accurate preference
๐Ÿ” Code comments can be distracting

Controlled experiments on input prompts.

๐Ÿ” Critic-Evol: can we use equally strong draft and critic models?

Case Studies

Examples of CodePrefBench tasks and responses from LLMs and human developers.

โœ… Human baseline misses requirement details in the instruction.


All models capture the โ€œlower-caseโ€ requirement, while all human annotators miss this detail in the description and choose Code A (likely due to its simplicity).

โœ… Reasoning mistakes in Claude 3.5 Sonnet and DeepSeek V2.5

๐Ÿš€ Algorithmic complexity analysis is important!

๐Ÿš€ Don't under weigh the efficiency significance of built-in functions!

๐Ÿ›ก๏ธ os.popen or subprocess.run? Human baseline predicts false security preference.

๐Ÿ›ก๏ธ SHA-256 or SHA-1 encryption? Defensive code security preference in Gemini.


Citation

@article{liu2024learning,
    title = {Learning Code Preference via Synthetic Evolution},
    author = {Liu, Jiawei and Nguyen, Thanh and Shang, Mingyue and Ding, Hantian and Li, Xiaopeng and Yu, Yu and Kumar, Varun and Wang, Zijian},
    journal = {arXiv preprint arXiv:2410.03837},
    year = {2024},
}