GitTaskBench

🥈Silver

GitTaskBench is a benchmark for evaluating code agents on real-world tasks. It measures performance from repo understanding to task delivery, with a cost-aware metric. Benefits operations teams by improving agent efficiency in development and bug-fixing workflows. Integrates with Python-based code agents like Claude.

252180Updated 1mo ago

Intermediate30min to implementautomation

Saves ~300 min per use

Quick InstallView Source

git clone https://github.com/QuantaAlpha/GitTaskBench.git

Works with:

Claude

Overview

About This Skill

How to Use

[{"step":"Define the task clearly. Include the repository URL, commit hash (if applicable), and specific requirements for the code agent. For example: 'Implement a feature X in repository Y (commit Z) that does A, B, and C.'","tip":"Be as specific as possible about inputs, outputs, and constraints to ensure accurate evaluation."},{"step":"Run the code agent on the task. Use a Python-based code agent like Claude or Cursor. Ensure the agent has access to the repository and necessary tools (e.g., terminal, file editor).","tip":"Monitor the agent's progress in real-time to identify potential bottlenecks or misunderstandings early."},{"step":"Collect the agent's outputs, including code changes, logs, and any intermediate steps. Save these outputs for evaluation.","tip":"Use a structured format (e.g., JSON) to log the agent's actions and decisions for easier analysis."},{"step":"Evaluate the agent's performance using the GitTaskBench framework. Assess the four dimensions: repository understanding, task completion quality, cost efficiency, and time efficiency.","tip":"Use the GitTaskBench metric to calculate an overall score and identify areas for improvement."},{"step":"Generate a detailed report with actionable recommendations. Share the report with the team to refine the agent's performance on future tasks.","tip":"Focus on specific, measurable improvements (e.g., 'Reduce token usage by 20% by reusing code snippets')."}]

Setup & Installation

Quick Install

No install command available. Check the GitHub repository for manual installation instructions.

Alternative Install (Git Clone)

git clone https://github.com/QuantaAlpha/GitTaskBench

Requirements

Claude Code or compatible AI agent
Works with: Claude

Quick Start Guide

Install the Skill

Copy the install command above and run it in your terminal.

Open Your AI Agent

Launch Claude Code, Cursor, or your preferred AI coding agent.

Try It Out

Use the prompt template or examples below to test the skill.

Customize

Adapt the skill to your specific use case and workflow.

Usage Examples

Prompt Template

Act as a GitTaskBench evaluator. Evaluate the performance of a code agent on the following task: [TASK_DESCRIPTION]. Use the GitTaskBench framework to assess the agent's performance across these dimensions: 1) Repository understanding (score 0-100), 2) Task completion quality (score 0-100), 3) Cost efficiency (score 0-100), 4) Time efficiency (score 0-100). Provide a detailed breakdown of the agent's strengths and weaknesses, and suggest improvements for future tasks. Use the GitTaskBench metric to calculate the overall score.

Example Output

### GitTaskBench Evaluation Report

**Task:** Implement a REST API endpoint in Python using FastAPI that returns a paginated list of users from a SQLite database. The endpoint should support filtering by `user_type` and sorting by `created_at`. The task was executed in the `user-management` repository (commit `a1b2c3d`).

#### Performance Metrics:

1. **Repository Understanding (Score: 92/100)**
- The agent correctly identified the project structure, including the `models.py` (SQLAlchemy models), `schemas.py` (Pydantic models), and `main.py` (FastAPI app).
- It accurately mapped the database schema and recognized the `User` model with fields: `id`, `name`, `email`, `user_type`, `created_at`.
- Minor oversight: The agent initially missed the `user_type` enum definition in `constants.py`, which caused a minor delay in implementation.

2. **Task Completion Quality (Score: 88/100)**
- The agent successfully implemented the `/users` endpoint with pagination (limit/offset), filtering by `user_type`, and sorting by `created_at`.
- The endpoint returns a JSON response with the correct structure:
```json
{
"data": [
{"id": 1, "name": "Alice", "email": "alice@example.com", "user_type": "admin", "created_at": "2023-01-01T00:00:00"},
{"id": 2, "name": "Bob", "email": "bob@example.com", "user_type": "user", "created_at": "2023-01-02T00:00:00"}
],
"pagination": {
"total": 2,
"limit": 10,
"offset": 0
}
}
```
- **Issues:**
- The agent did not handle edge cases (e.g., invalid `user_type` values or empty results).
- The pagination metadata was slightly off (total count was hardcoded as 2 instead of dynamically calculated).
- No unit tests were generated, though the task did not explicitly require them.

3. **Cost Efficiency (Score: 85/100)**
- The agent made 43 API calls to the LLM (Claude), consuming approximately 12,500 tokens (input: 8,200, output: 4,300).
- The agent reused existing code snippets (e.g., database connection logic) but did not optimize for token efficiency in prompts.
- **Improvement:** The agent could have reused the pagination logic from a similar endpoint in the codebase to reduce token usage.

4. **Time Efficiency (Score: 80/100)**
- Total time taken: 18 minutes and 42 seconds.
- Breakdown:
- Repository analysis: 5 minutes
- Implementation: 10 minutes
- Debugging: 3 minutes (fixing minor issues with sorting)
- **Bottleneck:** The agent spent significant time debugging the sorting logic due to a misunderstanding of SQLite's `ORDER BY` clause with timestamps.

#### GitTaskBench Metric:
- **Overall Score: 87.5/100**
- **Weighted Breakdown:**
- Repository Understanding: 25%
- Task Completion Quality: 35%
- Cost Efficiency: 20%
- Time Efficiency: 20%

#### Recommendations for Improvement:
1. **Repository Understanding:** Add a step to explicitly list all relevant files and their purposes before diving into implementation.
2. **Task Completion Quality:** Include edge-case handling (e.g., invalid inputs) and dynamic pagination metadata.
3. **Cost Efficiency:** Reuse code snippets and optimize prompts to reduce token usage (e.g., by referencing existing patterns).
4. **Time Efficiency:** Pre-validate assumptions about database schema or API requirements to avoid debugging delays.

#### Next Steps:
- Re-run the task with the suggested improvements and aim for a score above 90.
- Integrate GitTaskBench into the CI/CD pipeline to continuously evaluate code agent performance on real-world tasks.