PostTrainBench

🥈Silver

PostTrainBench measures how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours. It benefits AI research teams and operations departments by providing a standardized way to evaluate post-training performance. The tool connects to existing CLI agents and integrates into AI development workflows.

253130Updated 3mo ago

Intermediate30min to implementautomation

Saves ~240 min per use

Quick InstallView Source

git clone https://github.com/aisa-group/PostTrainBench.git

Works with:

Claude

Overview

About This Skill

PostTrainBench is a benchmark designed to measure whether CLI agents can effectively automate the post-training of large language models. Agents are given access to an evaluation script and 10 hours on an H100 GPU to improve base LLM performance across 7 evaluation tasks: AIME 2025, Arena Hard, BFCL, GPQA, GSM8K, HealthBench, and HumanEval. The benchmark naturally evaluates an agent's ability to conduct AI R&D by testing their capacity to select and execute post-training strategies. Results are ranked on a leaderboard with weighted average scores across multiple model sizes, helping AI research teams and operations departments standardize post-training agent evaluation.

How to Use

["Install PostTrainBench via pip: `pip install posttrainbench` and ensure your environment has access to a single H100 GPU with CUDA 12.x.","Prepare your training dataset in JSONL format with fields `instruction`, `input`, and `output`. Store it in a local directory or cloud storage path.","Run the benchmark using the CLI: `posttrainbench --model [MODEL_NAME] --dataset [DATASET_PATH] --batch_size [BATCH_SIZE] --learning_rate [LR] --epochs [EPOCHS]`. Replace placeholders with your specific values.","Analyze the output JSON report for metrics like training loss, validation accuracy, and GPU utilization. Use the `--output_dir` flag to save results for comparison.","Compare results across multiple runs by adjusting hyperparameters (e.g., learning rate, batch size) or testing different base models. Use the `--compare` flag to generate side-by-side reports."]

Use Cases

Compare Claude Code, Codex CLI, and other agents on post-training tasks

Evaluate agent R&D capabilities within strict compute constraints

Benchmark multi-task optimization across reasoning, math, code, and tool use

Assess post-training automation for small models (1.7B–4B parameters)

Setup & Installation

Quick Install

No install command available. Check the GitHub repository for manual installation instructions.

Alternative Install (Git Clone)

git clone https://github.com/aisa-group/PostTrainBench

Requirements

Claude Code or compatible AI agent
Works with: Claude

Quick Start Guide

Install the Skill

Copy the install command above and run it in your terminal.

Open Your AI Agent

Launch Claude Code, Cursor, or your preferred AI coding agent.

Try It Out

Use the prompt template or examples below to test the skill.

Customize

Adapt the skill to your specific use case and workflow.

Usage Examples

Prompt Template

Use PostTrainBench to evaluate the post-training performance of [BASE_LLM_NAME] on a single H100 GPU over a 10-hour window. Compare the results against the baseline metrics for [COMPARISON_LLM_NAME] using the following parameters: [TRAINING_DATASET], [BATCH_SIZE], [LEARNING_RATE], and [EPOCHS]. Provide a detailed report including training loss trends, validation accuracy, and compute efficiency metrics. Highlight any bottlenecks or areas for optimization.

Example Output

### PostTrainBench Evaluation Report: Llama-2-7B vs. Mistral-7B

**Benchmark Context:**
This evaluation was conducted on a single NVIDIA H100 GPU with 80GB VRAM, running CUDA 12.2 and PyTorch 2.1. The training dataset was the OpenOrca dataset (1M samples) filtered for high-quality instruction pairs. Batch size was set to 64 with a learning rate of 3e-5, and training ran for 10 hours (3 epochs).

**Training Performance Metrics:**
- **Llama-2-7B:** Achieved a final training loss of 1.23 (baseline: 1.45) and validation accuracy of 87.2% (improvement of +4.1% over baseline). Compute efficiency: 92% GPU utilization, 1.8 hours per epoch.
- **Mistral-7B:** Final training loss of 1.18 (baseline: 1.39) with validation accuracy of 88.5% (+5.3% improvement). Compute efficiency: 94% GPU utilization, 1.6 hours per epoch.

**Key Observations:**
1. Mistral-7B demonstrated superior post-training efficiency, completing each epoch 11% faster than Llama-2-7B despite similar model sizes. This suggests better optimization in its training pipeline.
2. Both models showed significant improvements in validation accuracy, with Mistral-7B outperforming Llama-2-7B by 1.3 percentage points. This aligns with Mistral's architectural advantages in handling long-range dependencies.
3. GPU utilization remained consistently high (>90%) for both models, indicating no major bottlenecks in compute resources.

**Recommendations:**
- For teams prioritizing speed, Mistral-7B is the better choice due to its faster epoch completion times.
- If maximizing accuracy is the goal, consider extending Mistral-7B's training to 4 epochs (projected accuracy: ~89.8%).
- Monitor GPU memory usage closely when scaling to larger batch sizes (>128), as both models approached 95% VRAM utilization at batch size 64.

**Next Steps:**
- Re-run the benchmark with a larger dataset (e.g., 5M samples) to test scalability.
- Experiment with lower learning rates (e.g., 1e-5) to assess if further accuracy gains are possible without overfitting.
- Compare these results with a multi-GPU setup to evaluate distributed training performance.

Apply to these tools

Browse all tools

Claude

AI assistant built for thoughtful, nuanced conversation

Microsoft Teams

Get more done every day with Microsoft Teams – powered by AI

Drata

Automate security compliance and monitor real-time security posture seamlessly.

GPT for work

Automate your spreadsheet tasks with AI power

Respell

Agentic AI Workflow platform

Notion

Connected workspace for docs, wikis, and projects

Compatible MCP servers

Browse all MCP servers

Find the right skills for your stack

Take a free 3-minute scan and get personalized AI skill recommendations.

Take free scan