AI Models & Cost Guide

Choosing models, understanding costs, and optimizing token usage for test generation.

Overview

SPECTRA uses two AI models per run: a generator (behavior analysis + test creation) and a critic (grounding verification). Choosing the right combination determines quality, speed, and cost. This guide is based on real production data — 1,261 test cases generated across 7 suites for $0.00.

GitHub Copilot Model Reference

All models are accessed through the github-models provider via the Copilot SDK. Premium request (PR) multipliers determine how much each call costs against your monthly allowance.

Included Models (0× multiplier — unlimited)

Model	Config value	Best for
GPT-4.1	`gpt-4.1`	Generation, critic, general purpose
GPT-4o	`gpt-4o`	Legacy (being deprecated)
GPT-5 mini	`gpt-5-mini`	Fast critic, light tasks

These models consume zero premium requests on any paid Copilot plan.

Premium Models (consume PRs from monthly allowance)

Model	Config value	Multiplier	Pro+ (1,500 PRs)
Claude Sonnet 4.5	`claude-sonnet-4.5`	1×	1,500 calls
Claude Sonnet 4.6	`claude-sonnet-4.6`	1×	1,500 calls
Claude Haiku 4.5	`claude-haiku-4.5`	0.33×	~4,500 calls
GPT-5	`gpt-5`	1×	1,500 calls
Claude Opus 4.5	`claude-opus-4.5`	3×	500 calls
Claude Opus 4.6	`claude-opus-4.6`	3×	500 calls

Copilot Plans

Plan	Price	Monthly PRs	Included models
Copilot Pro	$10/mo	300	GPT-4.1, GPT-4o, GPT-5 mini
Copilot Pro+	$39/mo	1,500	GPT-4.1, GPT-4o, GPT-5 mini
Overage	$0.04/PR	Unlimited	On-demand after allowance

Student Plan: Since March 12, 2026, Claude Sonnet, Claude Opus, and GPT-5.4 are removed from self-selection on the Student plan. Only Auto mode provides access to Anthropic models. Upgrade to Pro or Pro+ for direct Sonnet access.

Recommended Presets

The critic should always be a different model from the generator for independent hallucination detection (see Grounding Verification).

Preset 1: Best Quality (Recommended)

Sonnet generator + GPT-4.1 critic. Deep behavior analysis, cross-family verification, zero critic cost.

{
  "ai": {
    "providers": [
      { "name": "github-models", "model": "claude-sonnet-4.5", "enabled": true }
    ],
    "critic": {
      "enabled": true,
      "provider": "github-models",
      "model": "gpt-4.1"
    }
  }
}

Cost per --count 20 run: ~4 PRs (analysis + generation batches). Critic is free. Real-world result: 1,261 tests across 7 suites = 75 PRs total.

Preset 2: Zero Cost

GPT-4.1 generator + GPT-5 mini critic. Both unlimited. Good for 80% of use cases but shallower behavior analysis (~40 behaviors vs ~200 with Sonnet).

{
  "ai": {
    "providers": [
      { "name": "github-models", "model": "gpt-4.1", "enabled": true }
    ],
    "critic": {
      "enabled": true,
      "provider": "github-models",
      "model": "gpt-5-mini"
    }
  }
}

Cost: $0 always. Unlimited tests/month.

Preset 3: Budget Cross-Family

GPT-4.1 generator + Haiku critic. Free generation with cross-family verification at 0.33× per critic call.

{
  "ai": {
    "providers": [
      { "name": "github-models", "model": "gpt-4.1", "enabled": true }
    ],
    "critic": {
      "enabled": true,
      "provider": "github-models",
      "model": "claude-haiku-4.5"
    }
  }
}

Cost per --count 20 run: ~7 PRs (critic only). Generation is free.

Real Production Run Data

Actual results from a full production run. Generator: Claude Sonnet 4.5. Critic: GPT-4.1 (parallel, max_concurrent: 5). Both via github-models provider on Copilot Pro+. Some suites were regenerated multiple times during testing.

Run Results

Suite	Tests Generated	Gen Time	Critic Time	Total	PRs Used
Standard Calculator	238	22m26s	23m02s	46m19s	13
Unit Converter	181	18m34s	17m58s	37m20s	11
Date Calculation	398 (2 runs)	36m07s	43m08s	47m49s	23
General App Features	100	12m49s	10m22s	23m37s	7
Scientific Calculator	135	11m31s	13m19s	18m15s	8
Programmer Calculator	117	12m20s	14m57s	16m02s	7
Graphing Calculator	92	11m06s	10m08s	13m44s	6
Total	1,261	~2h05m	~2h13m	~3h23m	~75

Token Consumption

Suite	Input Tokens	Output Tokens	Total
Standard Calculator	5,898,939	184,274	6,083,213
Unit Converter	4,157,801	164,090	4,321,891
Date Calculation	9,191,320	341,179	9,532,499
General App Features	2,447,233	101,976	2,549,209
Scientific Calculator	3,319,320	101,543	3,420,863
Programmer Calculator	2,819,811	111,662	2,931,473
Graphing Calculator	2,376,626	85,621	2,462,247
Total	30,211,050	1,090,345	31,301,395

Per-Phase Timing

Phase	Avg per call	Notes
Analysis (Sonnet)	25–148s	Varies by doc complexity. Sonnet finds 200+ behaviors; GPT-4.1 finds ~40
Generation batch (Sonnet, 20 tests)	~110s	~5.5s per test
Critic call (GPT-4.1, parallel ×5)	~6s per call, ~1.2s effective	5 concurrent calls reduces wall time by ~80%

Cost Comparison

Full workload: 1,261 tests across 7 suites

Provider	Input Cost	Output Cost	Total
Copilot Pro+ (github-models)	included	included	$0.00 (~75 of 1,500 PRs)
Copilot Pro overage ($0.04/PR)	—	—	$3.00
Azure AI Foundry (Sonnet 4.5)	$90.63	$16.36	$106.99
Anthropic API direct	$90.63	$16.36	$106.99

Full monthly capacity at Pro+ (1,500 PRs)

Provider	Monthly Cost
Copilot Pro+	$39 (subscription)
Azure AI Foundry equivalent	~$2,169
Copilot overage equivalent	$60 (1,500 × $0.04)

Premium Request Budget

After generating 1,261 tests across all 7 suites (within a single billing cycle):

Metric	Value
PRs consumed (total account)	191.52 of 1,500
PRs from SPECTRA runs	~75 (Sonnet generation + analysis only)
PRs from VS Code / other usage	~116
PRs remaining	1,308 (19 days left in cycle)
Billed amount	$0.00

The 55× price difference between Copilot Pro+ and Azure pay-per-token exists because Copilot is a subscription model — Microsoft subsidizes heavy users with revenue from lighter users. SPECTRA’s workload (hundreds of structured API calls with large system prompts) is unusually token-intensive for a consumer subscription.

Batch Size & Timeout Tuning

Different models require different batch sizes and timeouts. Match your config to the model’s speed characteristics.

Model	Recommended batch_size	analysis_timeout	generation_timeout
GPT-4.1	20–30	3 min	5 min
Claude Sonnet 4.5	20	3 min	5 min
DeepSeek-V3.2	8	10 min	20 min
GPT-4o-mini	20–30	2 min	3 min

{
  "ai": {
    "analysis_timeout_minutes": 3,
    "generation_timeout_minutes": 5,
    "generation_batch_size": 20
  }
}

Quality Comparison: Sonnet vs GPT-4.1

Based on the same documentation (Standard Calculator suite):

Metric	Claude Sonnet 4.5	GPT-4.1
Behaviors discovered	~200–238	~39–40
Analysis depth	Deep edge cases, implicit rules	Surface-level, explicit rules
BVA exact boundaries	Specific values	Sometimes generic
Decision table combinations	4+ conditions	2–3 conditions
State transition chains	5+ states	2–3 states
Step specificity	Concrete actions, exact data	More generic phrasing
Expected result detail	Specific error messages	General outcomes

For simple CRUD documentation the difference is minimal. For complex business logic with implicit rules, Sonnet produces significantly more thorough coverage.

Debug Log & Monitoring

Enable debug logging to track token usage and timing per call:

{
  "debug": {
    "enabled": true,
    "mode": "append"
  }
}

Each AI call is logged with model, provider, tokens, and elapsed time:

[generate] BATCH OK requested=20 elapsed=113.9s model=claude-sonnet-4.5 provider=github-models tokens_in=174233 tokens_out=7618
[critic  ] CRITIC OK test_id=TC-100 verdict=Partial score=0.80 elapsed=8.9s model=gpt-4.1 provider=github-models tokens_in=13056 tokens_out=429

Every run ends with a summary line:

[summary ] RUN TOTAL command=generate suite=standard calculator calls=250 tokens_in=5898939 tokens_out=184274 elapsed=46m19s phases=generation:12/22m26s,critic:238/23m02s

Use --verbosity diagnostic to force-enable debug for a single run without changing the config.

Overage Budget Setup

If you exhaust your monthly PRs and want to continue with premium models, enable overage billing in GitHub Settings:

Go to GitHub Settings → Billing and licensing → Budgets and alerts
Set a budget for premium request overages (e.g., $10/month)
Additional PRs are billed at $0.04 each

Accounts created before August 22, 2025 have a default $0 budget — overages are blocked unless you explicitly set a budget. Without a budget, you fall back to included models (GPT-4.1, GPT-4o, GPT-5 mini) when your allowance runs out.

Migration from Azure / BYOK

If you’re moving from Azure-hosted models to GitHub Models:

Before (Azure OpenAI / Azure Anthropic):

{
  "ai": {
    "providers": [
      {
        "name": "azure-openai",
        "model": "DeepSeek-V3.2",
        "api_key_env": "AZURE_API_KEY",
        "base_url": "https://your-endpoint.azure.com/"
      }
    ],
    "analysis_timeout_minutes": 10,
    "generation_timeout_minutes": 20,
    "generation_batch_size": 8
  }
}

After (GitHub Models via Copilot Pro+):

{
  "ai": {
    "providers": [
      { "name": "github-models", "model": "claude-sonnet-4.5", "enabled": true }
    ],
    "analysis_timeout_minutes": 3,
    "generation_timeout_minutes": 5,
    "generation_batch_size": 20,
    "critic": {
      "enabled": true,
      "provider": "github-models",
      "model": "gpt-4.1"
    }
  }
}

Key changes: remove api_key_env and base_url (GitHub Models uses gh auth token), reduce timeouts (faster models), increase batch size (no timeout risk), switch critic to a different model family.

Authenticate with gh auth login and verify with spectra auth.