claudecode_gemini_and_codex_swebench

🥈Silver

Toolkit for measuring Claude Code and Codex performance over time against a baseline using SWEbench-lite dataset. No API key required for Max or Pro subscribers. Useful for operations teams to evaluate and compare code-focused language models on real-world software engineering tasks.

2660Updated 3mo ago

Intermediate30min to implementautomation

Saves ~30 min per use

Quick InstallView Source

git clone https://github.com/jimmc414/claudecode_gemini_and_codex_swebench.git

Works with:

Claude

Overview

About This Skill

The claudecode_gemini_and_codex_swebench skill is an essential toolkit designed for measuring the performance of Claude Code and Codex over time. Utilizing the SWEbench-lite dataset, this skill provides users with a robust framework to establish a performance baseline and track changes effectively. With no API key required for Max or Pro subscribers, this skill is accessible and straightforward to implement, making it a valuable addition to any AI automation toolkit. One of the key benefits of this skill is its ability to provide insights into the performance metrics of your AI models. By measuring performance against a baseline, developers and AI practitioners can identify trends, optimize workflows, and enhance the overall efficiency of their AI agents. Although the exact time savings are currently unknown, the ability to quickly assess model performance can lead to faster iterations and improved deployment times, ultimately saving resources in the long run. This skill is particularly beneficial for developers, product managers, and AI practitioners who are focused on optimizing AI workflows. By integrating this skill into their processes, teams can ensure that they are leveraging the full potential of their AI models, making informed decisions based on performance data. For example, a product manager might use this skill to evaluate the effectiveness of different AI models in a product feature, allowing for data-driven decisions that enhance user experience. With an intermediate implementation difficulty, users can expect to spend about 30 minutes setting up the claudecode_gemini_and_codex_swebench skill. This skill fits seamlessly into AI-first workflows, providing a structured approach to performance measurement that is essential for continuous improvement. By adopting this skill, teams can enhance their AI automation capabilities, leading to more effective and efficient solutions.

How to Use

1. **Prepare Your Environment:** - Ensure you have access to the claudecode_gemini_and_codex_swebench toolkit (included with Max/Pro subscriptions). - Clone the SWEbench-lite dataset locally or use the provided API endpoint: `https://swebench-lite.ai/api/v1`. - Install required dependencies: `pip install swebench-lite pandas matplotlib`. 2. **Run the Evaluation:** - Open your terminal and navigate to the toolkit directory. - Execute the evaluation script with your model and baseline: ```bash python evaluate_models.py \ --model-name "claude-code-latest" \ --baseline-model "codex-v1.2" \ --task-category "bug-fixes" \ --output-dir "./evaluation_results" ``` - For automated comparisons, use the `--compare` flag to generate side-by-side reports: ```bash python evaluate_models.py --compare "claude-code-latest" "codex-v1.2" "feature-additions" ``` 3. **Analyze Results:** - Review the generated JSON/CSV reports in `./evaluation_results`. - Use the provided visualization script to generate comparative plots: ```bash python visualize_results.py --input-dir "./evaluation_results" --output-dir "./plots" ``` - Focus on `pass@k` metrics and failure pattern summaries to identify trends. 4. **Integrate into Workflow:** - Schedule weekly evaluations using a cron job or CI/CD pipeline (e.g., GitHub Actions). - Set up Slack/email alerts for significant deviations (e.g., >5% drop in pass@1). - Use the failure pattern analysis to prioritize model improvements or dataset curation. 5. **Optimize for Your Use Case:** - Customize the `--task-category` parameter to match your team’s focus (e.g., `regression-tests`, `documentation-updates`). - Adjust the `--k-values` parameter to evaluate pass@1, pass@5, and pass@10 separately. - For large teams, use the `--parallel` flag to distribute evaluations across multiple machines.

Setup & Installation

Quick Install

No install command available. Check the GitHub repository for manual installation instructions.

Alternative Install (Git Clone)

git clone https://github.com/jimmc414/claudecode_gemini_and_codex_swebench

Requirements

Claude Code or compatible AI agent
Works with: Claude

Quick Start Guide

Install the Skill

Copy the install command above and run it in your terminal.

Open Your AI Agent

Launch Claude Code, Cursor, or your preferred AI coding agent.

Try It Out

Use the prompt template or examples below to test the skill.

Customize

Adapt the skill to your specific use case and workflow.

Usage Examples

Prompt Template

Evaluate the performance of [MODEL_NAME] on the SWEbench-lite dataset using the claudecode_gemini_and_codex_swebench toolkit. Compare its results against the baseline metrics for [BASELINE_MODEL] on [SPECIFIC_TASK_CATEGORY] (e.g., bug fixes, feature additions, or regression tests). Provide a breakdown of pass@k rates for k=1,5,10, and highlight any significant deviations from the baseline. Include a summary of the most common failure patterns observed in the evaluation.

Example Output

### Performance Evaluation Report: Claude Code vs. Baseline Models

**Dataset:** SWEbench-lite (100 Python repositories, 2,000+ tasks)
**Evaluation Date:** November 15, 2023
**Models Compared:**
- **Claude Code (Latest)**
- **Baseline Model:** Codex (v1.2)
- **Task Category:** Bug Fixes (60% of dataset)

#### Key Metrics Summary
| Metric               | Claude Code | Codex (Baseline) | Delta  |
|----------------------|-------------|------------------|--------|
| **pass@1**           | 42.3%       | 38.1%            | +4.2%  |
| **pass@5**           | 68.7%       | 64.5%            | +4.2%  |
| **pass@10**          | 79.2%       | 75.8%            | +3.4%  |
| **Average Runtime**  | 4.2 min     | 5.8 min          | -27.6% |

#### Detailed Analysis
Claude Code demonstrated superior performance in 7 out of 10 evaluated repositories, particularly excelling in tasks involving complex dependency resolution (e.g., fixing `pandas` merge operations with custom aggregations). The model achieved a **pass@1 rate of 42.3%**—a 4.2% improvement over Codex—with the most significant gains observed in tasks requiring multi-file modifications (e.g., fixing `django` view hierarchies).

**Failure Pattern Analysis:**
1. **Edge Cases:** Both models struggled with tasks involving rare `numpy` array configurations (e.g., broadcasting edge cases), though Claude Code resolved 12% more of these than Codex.
2. **Dependency Conflicts:** 30% of failures in both models stemmed from unresolved package conflicts, suggesting a need for better dependency management tools in the evaluation pipeline.
3. **Test Suite Limitations:** 15% of tasks failed due to incomplete test coverage in the repository, highlighting a limitation of the SWEbench-lite dataset rather than model performance.

**Recommendations:**
- **For Operations Teams:** Use this toolkit to set up automated weekly evaluations comparing new model versions against your baseline. Focus on pass@5 metrics to balance speed and reliability.
- **For Model Developers:** Investigate the 4.2% improvement in pass@1 rates for Claude Code—particularly in dependency-heavy tasks—to identify optimization opportunities.
- **For Dataset Curators:** Consider expanding test coverage in SWEbench-lite to reduce false negatives from incomplete test suites.

**Next Steps:**
- Schedule a follow-up evaluation after the next model update to track progress.
- Share this report with the engineering team to prioritize fixes for the most common failure patterns.

Apply to these tools

Browse all tools

Gemini

Google's multimodal AI model and assistant

Claude

AI assistant built for thoughtful, nuanced conversation

Baseline

Interactive proposals that win more deals

Microsoft Teams

Get more done every day with Microsoft Teams – powered by AI

Respell

Agentic AI Workflow platform

Notion

Connected workspace for docs, wikis, and projects

Compatible MCP servers

Browse all MCP servers

Find the right skills for your stack

Take a free 3-minute scan and get personalized AI skill recommendations.

Take free scan

Overview

About This Skill

How to Use

Quick Install

No install command available. Check the GitHub repository for manual installation instructions.

Alternative Install (Git Clone)

git clone https://github.com/jimmc414/claudecode_gemini_and_codex_swebench

Requirements

Claude Code or compatible AI agent
Works with: Claude

Quick Start Guide

Install the Skill

Copy the install command above and run it in your terminal.

Open Your AI Agent

Launch Claude Code, Cursor, or your preferred AI coding agent.

Try It Out

Use the prompt template or examples below to test the skill.

Customize

Adapt the skill to your specific use case and workflow.

Usage Examples

Prompt Template

Evaluate the performance of [MODEL_NAME] on the SWEbench-lite dataset using the claudecode_gemini_and_codex_swebench toolkit. Compare its results against the baseline metrics for [BASELINE_MODEL] on [SPECIFIC_TASK_CATEGORY] (e.g., bug fixes, feature additions, or regression tests). Provide a breakdown of pass@k rates for k=1,5,10, and highlight any significant deviations from the baseline. Include a summary of the most common failure patterns observed in the evaluation.

Example Output

### Performance Evaluation Report: Claude Code vs. Baseline Models

**Dataset:** SWEbench-lite (100 Python repositories, 2,000+ tasks)
**Evaluation Date:** November 15, 2023
**Models Compared:**
- **Claude Code (Latest)**
- **Baseline Model:** Codex (v1.2)
- **Task Category:** Bug Fixes (60% of dataset)

#### Key Metrics Summary
| Metric               | Claude Code | Codex (Baseline) | Delta  |
|----------------------|-------------|------------------|--------|
| **pass@1**           | 42.3%       | 38.1%            | +4.2%  |
| **pass@5**           | 68.7%       | 64.5%            | +4.2%  |
| **pass@10**          | 79.2%       | 75.8%            | +3.4%  |
| **Average Runtime**  | 4.2 min     | 5.8 min          | -27.6% |

#### Detailed Analysis
Claude Code demonstrated superior performance in 7 out of 10 evaluated repositories, particularly excelling in tasks involving complex dependency resolution (e.g., fixing `pandas` merge operations with custom aggregations). The model achieved a **pass@1 rate of 42.3%**—a 4.2% improvement over Codex—with the most significant gains observed in tasks requiring multi-file modifications (e.g., fixing `django` view hierarchies).

**Failure Pattern Analysis:**
1. **Edge Cases:** Both models struggled with tasks involving rare `numpy` array configurations (e.g., broadcasting edge cases), though Claude Code resolved 12% more of these than Codex.
2. **Dependency Conflicts:** 30% of failures in both models stemmed from unresolved package conflicts, suggesting a need for better dependency management tools in the evaluation pipeline.
3. **Test Suite Limitations:** 15% of tasks failed due to incomplete test coverage in the repository, highlighting a limitation of the SWEbench-lite dataset rather than model performance.

**Recommendations:**
- **For Operations Teams:** Use this toolkit to set up automated weekly evaluations comparing new model versions against your baseline. Focus on pass@5 metrics to balance speed and reliability.
- **For Model Developers:** Investigate the 4.2% improvement in pass@1 rates for Claude Code—particularly in dependency-heavy tasks—to identify optimization opportunities.
- **For Dataset Curators:** Consider expanding test coverage in SWEbench-lite to reduce false negatives from incomplete test suites.

**Next Steps:**
- Schedule a follow-up evaluation after the next model update to track progress.
- Share this report with the engineering team to prioritize fixes for the most common failure patterns.

claudecode_gemini_and_codex_swebench

Overview

About This Skill

How to Use

Tags

Setup & Installation

Quick Install

Alternative Install (Git Clone)

Requirements

Quick Start Guide

Install the Skill

Open Your AI Agent

Try It Out

Customize

Usage Examples

Prompt Template

Example Output

Apply to these tools

Gemini

Claude

Baseline

Microsoft Teams

Respell

Notion

Compatible MCP servers

context sync

mcp notion server

src to kb

notion mcp

slime

notion

Find the right skills for your stack

claudecode_gemini_and_codex_swebench

Overview

About This Skill

How to Use

Tags

Setup & Installation

Quick Install

Alternative Install (Git Clone)

Requirements

Quick Start Guide

Install the Skill

Open Your AI Agent

Try It Out

Customize

Usage Examples

Prompt Template

Example Output

Apply to these tools

Gemini

Claude

Baseline

Microsoft Teams

Respell

Notion

Compatible MCP servers

context sync

mcp notion server

src to kb

notion mcp

slime

notion

Find the right skills for your stack