REPOCOD: Can Language Models Replace Programmers?

The assumption that “AI can surpass human capabilities” is no longer surprising. It’s now seen more as a question of when, rather than if. From autonomous driving to programming, AI is quickly catching up to, or even surpassing, human abilities. The latest LLMs demonstrate over 90% accuracy in solving Python coding problems, surpassing human performance in certain tasks.

But can AI-generated code be applied directly in real-world projects? To address this curiosity, a new benchmark, REPOCOD, has emerged. REPOCOD moves away from artificial, simplified code generation tests, focusing instead on assessing AI performance within real project environments.

Limitations of Code Generation Benchmarks

The research team at Purdue University has raised concerns about the effectiveness of current LLMs in real-world code writing. Existing benchmarks like HumanEval and MBPP focus on generating single lines or short functions, which fall short of evaluating the repository-level or file-level context essential for actual software development. The team outlines three reasons why a new dataset is necessary. Let’s explore them.

Reflecting Real Code Completion Tasks: Benchmarks like HumanEval and MBPP mainly include artificially constructed code completion tasks, which don’t fully represent the real coding challenges faced by software developers.
Including Realistic Task Complexity: Unlike isolated algorithm tests, real development often requires complex context across multiple functions, files, and classes. Existing benchmarks focus primarily on standalone code snippets, missing the project-level context needed to evaluate comprehensive coding skills.
Reliable Accuracy Evaluation Standards: Many current benchmarks rely on similarity-based metrics to assess model performance. However, while code snippets might look similar, functional accuracy is critical.

Two examples showing misleading metrics results

The image above compares actual ground truth test data (GT) with LLM-generated data, scored using the CodeBLEU benchmark, a metric designed to evaluate code generation models based on syntax and grammatical compliance. The researchers use this example to illustrate flaws in current benchmarks.

In case (a), the CodeBLEU score is 58.06—a relatively low score. However, the LLM-generated code is functionally equivalent to the expected code. The similarity-based metric inaccurately penalizes this output due to differences in expression. Conversely, in case (b), the model receives a relatively high score, but it doesn’t actually work as expected because it diverges from the anticipated input values. Here, the high score is based solely on syntax similarity. This example highlights the need to consider overall context and practical usability, not just syntax, when evaluating code quality.

How is REPOCOD Structured?

Data collection pipeline and instance structure of REPOCOD

So, how is REPOCOD structured? REPOCOD gathers data through three key steps to evaluate code generation in real-world development environments. First, during the repository selection phase, only open-source repositories with over 2,000 stars that primarily use Python are chosen to ensure data quality.

Next, in the target function selection phase, REPOCOD employs both static and dynamic analysis techniques to identify test functions and their associated target functions within each repository. Static analysis examines the code’s syntax and structure, while dynamic analysis actively tracks other functions invoked during test execution.

Finally, in the related test case collection phase, only test cases directly linked to the target functions are selected and executed. This approach validates code accuracy efficiently without needing to run all test cases, enhancing both the precision and efficiency of the evaluation.

Assessing AI's Programming Skills

Now, with the newly structured dataset, it’s time to assess the AI’s actual coding abilities. So, how did it perform? The researchers compared it across various state-of-the-art (SOTA) models and multiple model sizes.

Pass@1(%) of SOTA LLMs on REPOCOD

The evaluation metric used was pass@1, a common indicator of a code generation model’s performance. This metric assesses whether the model generates correct code on the first attempt. Typically, code generation models can produce multiple solutions over several attempts, but pass@1 specifically measures the probability that the first generated solution matches the correct answer.

Even a high-performance model like GPT-4 achieved only a 27.35% pass@1 success rate on the REPOCOD dataset. While this is a demanding accuracy standard, the results starkly contrast with simpler coding tasks where models tend to score higher.

A general trend held that larger models showed slightly better performance. However, as code length and complexity increased, the benefits of model size diminished significantly, suggesting that merely scaling up model size is not enough.