GitTaskBench is a benchmark for evaluating code agents on real-world tasks. It measures performance from repo understanding to task delivery, with a cost-aware metric. Benefits operations teams by improving agent efficiency in development and bug-fixing workflows. Integrates with Python-based code agents like Claude.
git clone https://github.com/QuantaAlpha/GitTaskBench.gitGitTaskBench is a benchmark for evaluating code agents on real-world tasks. It measures performance from repo understanding to task delivery, with a cost-aware metric. Benefits operations teams by improving agent efficiency in development and bug-fixing workflows. Integrates with Python-based code agents like Claude.
[{"step":"Define the task clearly. Include the repository URL, commit hash (if applicable), and specific requirements for the code agent. For example: 'Implement a feature X in repository Y (commit Z) that does A, B, and C.'","tip":"Be as specific as possible about inputs, outputs, and constraints to ensure accurate evaluation."},{"step":"Run the code agent on the task. Use a Python-based code agent like Claude or Cursor. Ensure the agent has access to the repository and necessary tools (e.g., terminal, file editor).","tip":"Monitor the agent's progress in real-time to identify potential bottlenecks or misunderstandings early."},{"step":"Collect the agent's outputs, including code changes, logs, and any intermediate steps. Save these outputs for evaluation.","tip":"Use a structured format (e.g., JSON) to log the agent's actions and decisions for easier analysis."},{"step":"Evaluate the agent's performance using the GitTaskBench framework. Assess the four dimensions: repository understanding, task completion quality, cost efficiency, and time efficiency.","tip":"Use the GitTaskBench metric to calculate an overall score and identify areas for improvement."},{"step":"Generate a detailed report with actionable recommendations. Share the report with the team to refine the agent's performance on future tasks.","tip":"Focus on specific, measurable improvements (e.g., 'Reduce token usage by 20% by reusing code snippets')."}]
No install command available. Check the GitHub repository for manual installation instructions.
git clone https://github.com/QuantaAlpha/GitTaskBenchCopy the install command above and run it in your terminal.
Launch Claude Code, Cursor, or your preferred AI coding agent.
Use the prompt template or examples below to test the skill.
Adapt the skill to your specific use case and workflow.
Act as a GitTaskBench evaluator. Evaluate the performance of a code agent on the following task: [TASK_DESCRIPTION]. Use the GitTaskBench framework to assess the agent's performance across these dimensions: 1) Repository understanding (score 0-100), 2) Task completion quality (score 0-100), 3) Cost efficiency (score 0-100), 4) Time efficiency (score 0-100). Provide a detailed breakdown of the agent's strengths and weaknesses, and suggest improvements for future tasks. Use the GitTaskBench metric to calculate the overall score.
### GitTaskBench Evaluation Report
**Task:** Implement a REST API endpoint in Python using FastAPI that returns a paginated list of users from a SQLite database. The endpoint should support filtering by `user_type` and sorting by `created_at`. The task was executed in the `user-management` repository (commit `a1b2c3d`).
#### Performance Metrics:
1. **Repository Understanding (Score: 92/100)**
- The agent correctly identified the project structure, including the `models.py` (SQLAlchemy models), `schemas.py` (Pydantic models), and `main.py` (FastAPI app).
- It accurately mapped the database schema and recognized the `User` model with fields: `id`, `name`, `email`, `user_type`, `created_at`.
- Minor oversight: The agent initially missed the `user_type` enum definition in `constants.py`, which caused a minor delay in implementation.
2. **Task Completion Quality (Score: 88/100)**
- The agent successfully implemented the `/users` endpoint with pagination (limit/offset), filtering by `user_type`, and sorting by `created_at`.
- The endpoint returns a JSON response with the correct structure:
```json
{
"data": [
{"id": 1, "name": "Alice", "email": "alice@example.com", "user_type": "admin", "created_at": "2023-01-01T00:00:00"},
{"id": 2, "name": "Bob", "email": "bob@example.com", "user_type": "user", "created_at": "2023-01-02T00:00:00"}
],
"pagination": {
"total": 2,
"limit": 10,
"offset": 0
}
}
```
- **Issues:**
- The agent did not handle edge cases (e.g., invalid `user_type` values or empty results).
- The pagination metadata was slightly off (total count was hardcoded as 2 instead of dynamically calculated).
- No unit tests were generated, though the task did not explicitly require them.
3. **Cost Efficiency (Score: 85/100)**
- The agent made 43 API calls to the LLM (Claude), consuming approximately 12,500 tokens (input: 8,200, output: 4,300).
- The agent reused existing code snippets (e.g., database connection logic) but did not optimize for token efficiency in prompts.
- **Improvement:** The agent could have reused the pagination logic from a similar endpoint in the codebase to reduce token usage.
4. **Time Efficiency (Score: 80/100)**
- Total time taken: 18 minutes and 42 seconds.
- Breakdown:
- Repository analysis: 5 minutes
- Implementation: 10 minutes
- Debugging: 3 minutes (fixing minor issues with sorting)
- **Bottleneck:** The agent spent significant time debugging the sorting logic due to a misunderstanding of SQLite's `ORDER BY` clause with timestamps.
#### GitTaskBench Metric:
- **Overall Score: 87.5/100**
- **Weighted Breakdown:**
- Repository Understanding: 25%
- Task Completion Quality: 35%
- Cost Efficiency: 20%
- Time Efficiency: 20%
#### Recommendations for Improvement:
1. **Repository Understanding:** Add a step to explicitly list all relevant files and their purposes before diving into implementation.
2. **Task Completion Quality:** Include edge-case handling (e.g., invalid inputs) and dynamic pagination metadata.
3. **Cost Efficiency:** Reuse code snippets and optimize prompts to reduce token usage (e.g., by referencing existing patterns).
4. **Time Efficiency:** Pre-validate assumptions about database schema or API requirements to avoid debugging delays.
#### Next Steps:
- Re-run the task with the suggested improvements and aim for a score above 90.
- Integrate GitTaskBench into the CI/CD pipeline to continuously evaluate code agent performance on real-world tasks.Cloud ETL platform for non-technical data integration
IronCalc is a spreadsheet engine and ecosystem
Get more done every day with Microsoft Teams – powered by AI
Customer feedback management made simple
Enterprise workflow automation and service management platform
Automate your spreadsheet tasks with AI power
Take a free 3-minute scan and get personalized AI skill recommendations.
Take free scan